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Abstract 

Nested  transactions,  a  generalization  of  atomic  transactions,  provide  a  uniform  mechanism 
for  coping  with  failures  and  obtaining  concurrency  within  an  action.  Execution  of  a  nested 
action  is  synchronized  with  its  creating  action  by  halting  the  creator  until  the  execution  of  its 
nested  action  terminates. 

This  thesis  proposes  a  new  kind  of  a  nested  action  called  a  Disconnected  Action  (DA,  for 
short)  that  runs  asynchronously  with  its  creating  action.  Disconnected  actions  allow  additional 
work  to  be  done  in  parallel  with  the  rest  of  the  work  in  a  nested  atomic  action  system.  Work 
done  by  a  DA  improves  the  performance  of  its  creating  action,  but  is  not  needed  for  the 
correctness  of  this  action. 

This  thesis  describes  how  DAs  semantics  are  achieved;  existing  mechanisms,  such  as  the 
two-phase  commit,  are  modified,  and  serial  behavior  of  DAs  is  ensured  by  a  mechanism  that 
uses  timestamps  to  enforce  static  creation  order,  and  tables  to  monitor  and  control  dynamic 
serialization  order  of  concurrent  DAs.  The  thesis  also  proposes  a  technique  that  allows  an 
action  using  DAs  to  commit  in  spite  of  some  failures. 


Thesis  Supervisor:  Barbara  H.  Liskov 

Title:  N.E.C.  Professor  of  Software  Science  and  Engineering 

Keywords:  Distributed  systems,  Transactions,  Nested  transactions,  Disconnected  nested  trans¬ 
actions,  Concurrency  control. 
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Chapter  1 


Introduction 


,  '  -/  The  main  challenge  facing  the  design  of  reliable  distributed  computing  systems  is  maintaining 
data  integrity  in  the  presence  of  concurrency  and  failures.  Atomic  transactions,  or  actions*,  are 
a  widely  accepted  solution  to  these  two  problems;  they  are  serializable  and  recoverable,  thus 
hiding  concurrency  and  failures. 

1  Nested  transactions ‘'(•[Reed' 1978,  Moss  1981,  Liskov  &  Scheifler  1983])-are  a  generalization 
of  the  model  of  atomic  transactions.  Nested  transactions,  or  subactions,  provide  a  uniform 
mechanism  for  coping  with  failures  and  obtaining  concurrency  within  an  action.  Execution  of  ^  ^  ^ 

actions  in  a  nested  action  model  is  synchronized  in  a  manner  analogous  to  making  procedure 
calls  in  a  programming  language;  an  action  that  created  a  subaction  (or  a  group  of  concurrent 
subactions)  halts  until  the  execution  of  its  subaction  terminated. 

This  thesis  proposes  a  new  kind  of  a  nested  action  called  a  Disconnected  Action  (DA,  for 
short)  that,  unlike  a  regular  subaction,  runs  asynchronously  with  its  creating  action.  Discon¬ 
nected  actions  allow  additional  work  to  be  done  in  parallel  with  the  rest  of  the  work  in  a  nested 
atomic  action  system.  Work  done  by  a  DA  improves  the  performance  of  its  parent  action,  but 
is  not  needed  for  the  correctness  of  this  action. 


Disconnected  actions  can  be  used  to  perform  benevolent  side  elfects  within  the  scope  of 
their  action.  They  can  be  used  to  update  caches  or  improve  representation  of  abstract  data 
objects.  For  example,  consider  an  object  representing  a  rational  number  (a  fraction),  with  a 
pair  of  numbers  as  its  internal  representation,  for  numerator  and  denominator.  An  action  may 
create  a  subaction  to  update  the  object,  followed  by  a  DA  to  bring  the  fraction  into  canonical 
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form. 

The  rest  of  this  chapter  is  organized  as  follows:  A  motivating  example  for  the  use  of  discon¬ 
nected  actions,  a  replicated  file,  is  presented  in  detail  below.  Section  1.2  overviews  some  basic 
work  in  the  field  of  nested  atomic  actions.  The  chapter  concludes  with  an  overview  of  the  rest 
of  the  thesis. 


1.1  Replicated  File  System  -  an  example 

Assume  a  replicated  file  system  that  employs  a  simple  voting  scheme  (like  those  in  [Thomas  1979, 
Gifford  1979]).  There  are  N  sites,  each  containing  a  replica  of  the  file  and  a  version  number.  A 
file  is  updated  by  assembling  a  write  quorum  of  W  sites,  and  read  by  a  read  quorum  of  R  sites. 
In  order  to  ensure  one  copy  serializability1 ,  the  quorums  should  satisfy  two  constraints: 

W  +  R  >  N  and  2  W>N 

that  is,  the  read  quorum  and  write  quorum  should  include  at  least  one  common  site  to  ensure 
that  a  read  operation  will  see  effects  of  previous  write  operations,  and  having  at  least  one 
common  site  for  every  two  write  quorums  ensures  serial  order  of  write  operations. 


Write  Quorum 


f  SITE A 


(20m»  y 


FILE 

Version  #  1 


[SITEb 


(60m  ty 


f  SITEc 


(60mj)> 


f  SIT  Ed  (som.) 


r  sites 


V. 


Read  Quorum 

Figure  1-1:  An  example  for  a  voting  scheme. 

Figure  1-1  shows  an  example  for  a  replicated  file  system  using  the  voting  scheme  described 
above,  with  live  sites  and  read  or  write  quorums  of  three  sites  (i.e.,  N  =  5,  W  =  Z,R  =  3). 


'Meaning:  the  user  should  not  be  able  to  observe  the  replication.  The  replicated  file  should  behave  like  a 
single  copy. 
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Each  site  has  its  copy  of  the  file  and  a  version  number.  The  up-to-date  copies  have  the  highest 
version  numbers.  The  figures  in  parenthesis  indicate  the  latency  from  the  reader’s  site. 

Since  files  usually  are  large  in  size,  a  file  is  read  in  two  phases.  First,  a  read  quorum  is 
assembled,  and  the  version  numbers  of  the  files  in  this  quorum  are  read.  The  reader  can  tell, 
based  on  these  numbers,  which  sites  in  its  quorum  have  up-to-date  replicas  of  the  file.  In  the 
second  phase,  the  file  itself  is  read  from  one  of  these  up-to-date  sites,  preferably  from  a  site 
that  has  the  lowest  latency  (the  local  copy,  for  example). 

Though  the  reader  is  guaranteed  to  get  the  current  version  of  the  file,  its  choice  in  selecting 
a  site  to  read  from  may  be  limited,  and  its  preferred  site  may  not  have  an  up-to-date  version. 
In  the  above  example,  assume  that  all  the  sites  started  with  a  consistent  copy  of  the  file  with 
version  number  1,  and  a  write  operation  updated  the  files  of  a  quorum  of  sites  B,  C  and  D, 
incrementing  their  version  numbers  to  2.  A  following  read  operation,  using  a  quorum  of  sites  A, 
B  and  C,  finds  the  current  version  in  sites  B  and  C  only.  Site  A,  which  has  the  lowest  latency, 
can  not  be  used  since  its  version  is  not  up-to-date. 

We  would  like  to  have  the  writer  write  to  all  sites,  but  not  be  delayed  longer  than  necessary. 
This  can  be  accomplished  by  having  the  writer  write  to  some  write  quorum,  and  then  continue 
and  have  the  rest  of  the  replicas  updated  “in  the  background”.  The  work  done  in  the  “back¬ 
ground”  should  be  consistent  with  the  semantics  of  transactions.  That  is,  the  effects  of  the 
“background”  updates  should  disappear  if  the  updating  action  or  one  of  its  ancestors  aborts, 
and  become  permanent  when  the  topaction  commits. 

Disconnected  actions  can  serve  this  purpose.  A  writer  trying  to  update  a  replicated  object 
can  write  to  a  write  quorum  using  regular  subactions,  and  after  they  commit  create  disconnected 
actions  to  write  to  the  rest  of  the  replicas  (those  not  in  the  write  quorum).  The  writer  continues 
to  run  immediately  after  creating  those  DAs;  their  fate  has  no  effect  on  the  correctness  of  the 
update  operation. 

Better  performance  of  the  write  operation  is  also  expected  in  the  scheme  above,  because 
replicas  are  more  likely  to  be  up-to-date  in  the  scheme  that  uses  DAs  without  additional  time 
cost.  In  a  scheme  without  DAs,  which  updates  only  a  write  quorum,  some  replicas  may  be  far 
behind.  With  large  files,  when  a  write  operation  that  modifies  part  of  the  file  (e.g.,  an  append 
operation)  finds  such  an  archaic  replica  in  its  quorum,  it  has  to  do  more  work:  it  has  to  read 
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more  records  from  an  up-to-date  replica  and  write  those  records  to  the  archaic  one. 

1.2  Related  work 

The  idea  of  nested  atomic  actions  was  initially  proposed,  as  spheres  of  control,  in  [Davies  1973, 
Bjork  1973].  The  first  detailed  design  for  a  model  that  uses  nested  atomic  actions  was  developed 
by  Reed  ([Reed  1978]).  Reed  proposed  a  multi-version,  timestamp-’  ased  algorithm  to  ensure 
serialization  of  concurrent  actions.  A  locking-based  model  of  a  nested  atomic  action  system 
was  developed  by  Moss  ([Moss  1981]). 

Several  research  projects  have  designed  and  implemented  a  nested  atomic  action  system:  Ar¬ 
gus  ([Liskov  &  Scheifler  1983])  uses  Moss’s  model  with  small  modifications  (e.g.,  no  distinction 
between  lock-holding  and  lock-retaining).  The  Camelot  project  ([Spector,  et  al.  1987])  uses  a 
model  of  nested  actions  similar  to  Argus’,  but  with  several  differences  (e.g.,  a  remote  calls  does 
not  require  creating  a  new  subaction).  And  the  Clouds  project  ([Allchin  &  McKendry  1983]) 
also  uses  a  similar  model  with  some  variations  (e.g.,  using  a  top-level  commit  protocol  to  com¬ 
mit  subactions  at  any  level,  hence  subactions  are  not  as  “cheap”  as  in  Argus).  Clouds  also  uses 
different  levels  of  object  locking,  and  leaves  the  choice  of  level  to  the  programmer. 

Our  work  has  some  similarities  to  other  works  that  enhanced  the  nested  action  model  in 
some  ways.  In  [Walker  1984],  Walker  proposed  algorithms  that  piggybacked  information  on 
existing  messages  in  order  to  detect  orphaned  computations.  In  [Perl  1987],  Perl  also  uses 
similar  techniques  to  reduce  the  need  for  lock  query  messages  by  propagating  commit  and 
abort  information  with  existing  messages  (she  calls  it  eager  diffusion ). 

There  are  two  basic  techniques  for  serializing  concurrent  actions:  timestamping  ([Reed  1978, 
Aspnes,  et  al.  1988])  and  two-phase  locking  ([Eswaran,  et  al.  1976]).  Our  technique  of  adding 
timestamps  to  the  locking  based  model  can  be  considered  as  a  hybrid  serialization  technique. 
Hybrid  protocols  are  discussed  in  [Bernstein  &  Goodman  1981];  their  protocols  are  general 
methods  of  synchronization  that  handle  each  of  the  read-write  or  write-write  conflicts  with 
separate  two-phase  locking  or  timestamp  techniques.  Another  hybrid  technique  is  discussed  in 
[Weihl  1984];  it  proposes  the  use  of  (static)  crea.t, ion-order  timestamps  for  read-only  activities 
and  (dynamic)  commit-order  timestamps  for  update  activities. 

Our  technique  of  explicit  control  of  serialization  (using  constraint  tables)  relates  to  the 
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idea  of  conflict  graphs,  also  discussed  in  [Bernstein  &  Goodman  1981].  However,  the  conflict 
graphs  there  are  used  only  as  a  helping  tool  for  a  scheduler  to  improve  performance  of  a  certain 
timestamp  technique.  Our  notion  of  similarity  between  deadlock  detection  and  serializability 
of  concurrent  actions  has  also  been  noted  in  [Zhao  &  Ramamritham  1985]. 

Much  work  has  been  done  on  the  design  of  commit  protocols  for  top-level  actions.  [Gray  1978] 
describes  the  classic  two-phase  commit  protocol,  and  some  variations  like  “nested  two  phase 
commit  protocol”,  where  the  participants  are  ordered  and  every  one  communicates  with  the  next 
one.  Other  variations  include  non-blocking  protocols  ([Skeen  1981])  and  three-phase  protocols 
(like  two-phase,  but  with  an  initial  phase  that  tells  the  participant  to  “prepare  for  prepare”). 

Our  work  proposes  a  model  that  tries  to  commit  the  transaction  in  spite  of  failures.  In 
[Gifford  &  Donahue  1985],  a  different  way  is  tried  to  avoid  aborting  a  long-lived  atomic  trans¬ 
action  due  to  failures.  They  propose  breaking  the  atomic  transaction  into  independent  atomic 
actions;  thus  a  failure  causes  only  a  few  of  the  actions  to  abort,  and  they  could  be  retried. 


1.3  Road  map 

The  reminder  of  this  thesis  is  organized  as  follows: 

•  Chapter  2:  Describes  the  current  model  of  computation  and  defines  terminology  and 
notation. 

•  Chapter  3:  Describes  how  disconnected  actions  fit  into  the  current  model. 

•  Chapter  4:  Describes  how  concurrency  control  is  handled  in  a  system  using  DAo.  Two 
mechanisms  are  described,  timestamping  and  conflict  tables,  that  ensure  serialization  of 
DAs. 

•  Chapter  5:  Describes  how  the  two  phase  commit  protocol  should  be  modified  to  handle 
a  system  that  uses  DAs. 

•  Chapter  6:  Proposes  methods  to  enable  transactions  to  commit  in  spite  of  failures  by 
taking  advantage  of  the  semantics  of  DAs. 

•  Chapter  7:  Summarizes  the  work  and  proposes  future  additions. 
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Chapter  2 


The  System  Model  and  Definitions 


In  this  chapter  we  describe  our  model  of  computation,  which  is  basically  the  high-level  model 
employed  by  the  Argus  programming  language  and  system  ([Liskov  1984,  Liskov,  et  al.  1987b]). 

2.1  The  underlying  system 

We  view  the  low-level  system  as  a  distributed  collection  of  nodes  (i.e.  computers)  connected  by 
a  communication  network.  The  nodes  communicate  only  by  sending  messages  over  the  network. 
A  node  is  typically  a  processor  with  some  fast  volatile  memory  memory  and  slower,  non-volatile, 
stable1  storage. 

Any  component  of  the  system  may  fail,  but  the  likelihood  of  multiple  simultaneous  failures 
is  small.  Messages  sent  over  the  network  may  be  lost,  corrupted  or  duplicated.  We  assume 
that  an  underlying  Datagram  protocol  (e.g.,  [Postel  1980,  Boggs,  et  al.  1979])  ensures  delivery 
of  uncorrupted  messages.  When  that  Datagram  protocol  fails  (i.e.,  network  partition ),  the 
high-level  system  has  to  be  notified. 

Nodes  may  fail,  but  we  assume  that  they  eventually  recover.  When  a  node  fails  (i.e., 
crashes ),  the  contents  of  its  fast  volatile  memory  are  lost,  but  its  stable  storage  remains  intact. 
We  assume  fail-stop  processors  ([Schlichting  &  Schneider  1983])  in  the  nodes;  that  is,  a  failed 
processor  does  not  send  random  messages  or  write  arbitrarily  to  its  storage. 

‘Some  researchers  make  a  distinction  between  non-volatile  storage  (e.g.,  single  local  disk)  and  stable  storage 
(e.g.,  double-disk  scheme  (  [Lampson  &  Sturgis  1976]  ),  but  in  this  thesis  we  shall  treat  the  two  as  one. 
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2.2  The  current  model 


This  section  describes  the  model  of  computation  that  is  used  currently  by  the  Argus  system 
([Liskov,  et  al.  1987b]),  without  any  changes  or  optimizations  (e.g.,  as  suggested  in  [Perl  1987]). 
Some  small  modifications  are  introduced,  however,  both  to  the  current  implementation  and 
the  terminology  to  make  it  easier  to  extand  the  model  to  support  our  new  algorithms  and 
protocols.  When  the  algorithms  are  not  straightforward  ,  we  describe  them  using  a  Clu-like 
notation  ([Liskov  &  Guttag  1986]). 

2.2.1  Sites 

A  site,  or  guardian  in  Argus,  is  an  abstraction  for  a  node2.  The  high-level  system  is  a  collection 
of  sites,  each  encapsulating  several  resources.  The  site  keeps  its  state  in  internal  data  objects, 
not  accessible  from  the  outside;  the  r>nly  way  to  manipulate  these  objects  from  the  outside  (e.g., 
by  other  sites)  is  through  handler  cu  ?,  similar  to  the  way  operations  manipulate  the  internal 
representation  of  an  abstract  data  object. 

Processes  ( threads )  are  created  inside  the  site  to  carry  out  handler  calls  and  background 
activity.  Threads  may  share  and  manipulate  the  site’s  data  objects  directly.  Each  thread  has 
its  own  execution  stack,  but  compound  data  objects  are  allocated  from  a  heap.  All  threads  are 
lost  when  their  site  crashes. 

A  site  has  two  kinds  of  objects:  stable  objects  and  volatile  objects.  Stable  objects  are  written 
to  stable  storage  after  being  modified  by  top-level  actions  (see  below);  therefore  stable  objects 
survive  site  crashes  with  high  probability.  Volatile  objects  are  lost  when  their  site  crashes; 
they  are  used  for  recording  redundant  information  (e.g.,  indices)  or  information  that  can  be 
discarded  after  a  site  crash  (e.g.,  internal  state  of  a  thread).  Objects  are  handled  internally  at 
site;  no  reference  to  an  object  can  be  sent  to  another  site  (at  the  system  level). 

2.2.2  Actions 

Atomic  actions  are  a  mechanism  for  maintaining  data  consistency  in  the  presence  of  concurrency 
and  failures.  An  atomic  action  transfers  the  state  of  the  system  from  one  consistent  state  (i.e., 

2Though  several  sites  may  exist  at  a  single  node. 
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a  state  that  satisfies  some  invariants)  to  another  consistent  state.  Atomic  actions  (actions,  for 
short)  were  originally  used  in  centralized  systems,  but  are  particularly  important  in  distributed 
systems,  where  concurrency  is  real3  and  failures  may  be  partial.  Distributed  programs  may  run 
simultaneously,  each  at  several  sites,  share  data  objects  and  have  to  cope  with  failures  such  as 
site  crash  and  network  partition.  Actions  are  the  basic  programming  tool  to  build  distributed 
programs  that  cope  with  concurrency  and  failures. 

Actions  are  serializable  and  recoverable  (total).  Serializability  means  that  when  actions  are 
executed  concurrently,  the  effect  is  as  if  they  were  run  sequentially  in  some  order.  This  feature 
allows  programmers  using  actions  to  ignore  concurrency  to  a  great  extent4.  Recoverability 
means  that  an  action  either  completes  successfully  or  is  guaranteed  to  have  no  effect  (i.e.,  “all 
or  nothing”  semantics).  An  actions  that  completes  is  said  to  commit ;  otherwise,  the  action 
aborts.  ( Terminate  is  a  general  term  for  either  commit  or  abort.)  The  recoverability  feature 
serves  as  an  automatic  checkpoint  mechanism  for  programmers,  saving  the  trouble  of  writing  a 
“rollback”  code  to  undo  work  of  a  program  that  can  not  complete  due  to  failures. 

Atomicity  is  implemented  in  our  model  with  the  use  of  atom.ic  objects.  This  mechanism 
synchronizes  accesses  of  actions  to  shared  objects  to  provide  serializability,  and  provides  a  way 
to  recover  the  old  state  of  any  objects  modified  by  an  action  that  aborts. 

2.2.3  Nested  Actions 

Our  model  generalizes  atomic  actions  to  be  nested.  A  nested  action  is  an  action  that  is  started 
inside  another  action,  and  may  itself  start  more  nested  actions,  forming  an  action  tree  of  arbi¬ 
trary  levels.  Nested  actions  provide  action  semantics  within  an  action;  they  act  as  a  checkpoint 
mechanism  within  an  action,  and  also  provide  concurrency  within  an  action. 

We  use  the  following  terminology  for  actions  in  our  model:  a  subaction  is  a  nested  action. 
A  topaction  is  an  action  that  may  have  subactions,  but  is  not  itself  a  subaction  (i.e.,  the  top  of 
the  action  tree).  An  action  is  any  single  action  (i.e.,  either  subaction  or  topaction).  The  term 
transaction,  commonly  used  in  systems  without  nested  actions  instead  of  the  term  action ,  is 
used  here  to  refer  to  the  “whole  action  tree”,  instead  of  a  specific  action  in  the  tree.  For  example, 

3 Unlike  timesharing  centralized  systems  that  use  global  data  for  synchronization. 

4Some  concurrency  related  problems,  such  as  deadlocks,  may  still  need  a  programmer’s  consideration. 


we  may  talk  about  an  object  modification  done  by  the  transaction,  instead  of  specifying  exactly 
which  subaction  (or  topaction)  did  it. 

Tree  terminology  is  used  to  refer  to  relations  between  actions.  An  action  A  creating  a  nested 
action  B  is  a  parent,  while  B  is  a  child.  We  also  use  the  term  descendant,  meaning  any  of  the 
action  itself,  its  children,  their  children,  and  so  on.  For  example,  A,  B  and  any  child  C  of  B 
are  all  descendants  of  A.  Similarly  the  term  ancestor  is  used  in  the  other  direction  (e.g.,  A,B 
and  C  are  all  C's  ancestors).  The  prefix  proper  is  used  to  exclude  the  action  itself.  For  example, 
only  B  and  C  are  A’s  proper  descendants,  and  only  A  and  B  are  C’s  proper  ancestors.  The 
term  relative5  is  used  to  relate  one  action  to  another;  both  must  be  from  the  same  transaction. 
The  term  least  common  ancestor  (or  LCA)  of  two  actions,  A  and  B,  refers  to  the  lowest  action 
in  the  tree  that  is  an  ancestor  of  both  A  and  B. 

An  action  may  create  one  nested  action  at  a  time,  or  a  set  of  subactions  that  run  concur¬ 
rently.  In  both  cases  the  action  is  synchronized  with  its  children  by  being  blocked  until  the 
active  children  terminate.  Children  created  by  the  same  action  are  related  to  each  other  as 
siblings. 

Subactions  that  belong  to  the  same  set  of  concurrent  subactions  are  related  to  one  another 
as  concurrent  siblings.  Subactions  from  different  sets  (of  the  same  parent)  are  not  concurrent 
siblings.  We  use  the  term  relation  to  refer  to  one  of  these  two  subaction  creation  orderings, 
serial  or  concurrent.  Unlike  some  previous  works,  this  thesis  does  not  regard  concurrency  as 
an  inherent  feature  of  the  action  but  as  its  (ordering)  relation  to  another  relative  action6. 
Furthermore,  we  found  it  useful  to  think  about  (the  implementation  of)  a  set  of  concurrent 
actions  as  a  single  special  subaction  ( concurrent-set  action)  that  does  no  real  work  (like  the 
call-action  described  below),  but  only  creates  a  set  of  subactions  that  run  concurrently  and 
commits  after  all  the  concurrent  subactions  in  its  set  have  terminated. 

Siblings  that  are  not  concurrent  are  referred  to  as  serial  (or  sequential )  siblings.  When 
discussing  serial  siblings,  we  may  use  the  terms  later  or  earlier  (prior)  to  refer  to  the  order  in 
which  they  were  created  (and  ran).  We  use  the  terms  serially  related  and  concurrently  related 
for  relative  actions  A  and  B  to  note  the  relation  between  their  ancestors,  A'  and  B' ,  that  are 

sThink  about  relative  actions  as  “cousins”. 

6Thc  distinction  is  not  important  in  the  current  model,  which  does  not  allow  descendants  of  subactions  in 
different  sets  of  concurrent  subactions  (of  the  same  parent)  to  be  active  simultaneously. 
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serial  or  concurrent  siblings. 

The  commit  of  a  subaction  is  always  relative  to  its  parent.  When  an  action  aborts,  all  the 
effects  created  by  its  (committed)  descendants  are  undone.  When  a  subaction  S  and  all  its 
ancestors  up  to  the  topaction  committ,  we  say  that  S  has  committed  to  the  top. 

Handler  Calls 

An  action  runs  at  a  single  site  only.  When  an  action  A  wants  to  invoke  a  handler  call  to 
another  site,  another  action  H  has  to  run  there.  This  way  A  is  isolated  from  failures  of  #’s  site 
or  the  network.  In  our  model,  when  A  at  site  Ga  tries  to  use  a  handler  on  site  Gh,  a  special 
subaction  C  called  a  call-action  is  created  at  Ga  i-o  carry  out  the  call  to  Gh,  where  a  handler 
call  subaction  (H)  is  created.  The  call-action  C  does  no  real  work,  but  enables  A  to  abort  its 
handler  call  K  by  aborting  C  locally,  with  no  need  to  delay  until  IPs  site  is  notified;  such  a 
delay  can  be  very  long  when  the  network  partitions.  Note  that  C  always  commits  to  A,  even 
if  if  aborted.  For  simplicity,  we  often  ignore  call- actions  in  the  rest  of  this  thesis  when  their 
existence  has  no  effect  on  our  algorithms. 

Nested  Topactions 

Our  model  also  includes  nested  topactions.  As  their  name  suggests,  these  are  topactions  created 
inside  actions.  Unlike  a  normal  subaction,  the  commit  of  a  nested  topaction  is  independent  of 
the  commit  of  its  parent  (and  the  parent’s  transaction);  effects  created  by  a  nested  topaction 
become  permanent  upon  its  commit.  The  only  difference  between  a  nested  topaction  and  a 
non-nested  one  is  that  the  former  was  started  from  within  some  action.  Nested  topactions 
are  used  to  perform  benevolent  side-effects,  such  as  rearranging  data  objects  to  exhibit  better 
performance  (e.g.,  sort  a  list),  without  changing  their  observed  values. 

Action  Identifiers 

Action  identifiers  ( AIDs )  are  used  to  name  actions.  The  AID  is  a  data  structure  that  contains 
all  the  information  describing  the  action’s  location  in  its  action  tree  and  its  site.  The  AID 
used  in  this  thesis  is  slightly  different  from  the  original  one,  including  a  change  to  observe 
the  distinction  between  concurrent  siblings  and  siblings  of  different  concurrent  sets  (discussed 


18 


above)  and  accommodating  new  features  (like  marking  an  action  as  disconnected).  Given  the 
AIDs  of  two  actions  of  the  same  transaction,  we  can  determine  whether  they  are  serially  or 
concurrently  related. 

Our  AID  is  implemented  as  a  list  of  tagged  numerators;  a  possible  implementation  is  given 
in  Appendix  A.  The  list  describes  the  action’s  ancestry;  an  action’s  AID  is  basically  its  parent’s 
AID  concatenated  with  a  tagged  numerator.  The  tag  identifies  the  action  (e.g.,  S  for  subaction, 
G  for  a  new  site  (handler  call),  C  for  concurrent-set  action  and  T  for  topaction),  and  the 
numerator  differentiates  among  similar  siblings  and  indicates  serial  ordering  relation  between 
serial  siblings  (except  for  handler  calls,  where  the  numerator  indicates  the  site,  and  uniqueness 
and  serial  ordering  is  provided  by  the  call-action’s  numerator). 

Actions  in  examples  in  this  thesis  are  given  names  with  a  format  similar  to  their  AID.  For 
example,  G5.T84.S3.C2.T3.Sl.G6  is  the  name  of  a  handler  call  made  to  sice  6  by  the  call- 
action  1  of  the  nested  topaction  Gr5.2’84.53.G2.T3,  which  belongs  to  the  concurrent  set  2  of 
the  third  subaction  of  topaction  84,  within  site  5.  For  simplicity,  we  often  use  a  simple  short 
form,  ignoring  call-actions  and  new  sites.  For  example,  A. 2  marks  the  second  subaction  of  the 
action  named  A,  or  A.C3.5  marks  the  fifth  subaction  in  the  third  concurrent  set  of  A,  created 
serially  after  A. 2  (subactions  and  concurrent-set  actions  use  the  same  counter  to  generate  their 
numerators). 

Implementation  of  actions 

An  action  is  implemented  inside  a  site  as  a  data  structure  ( AINFO )  that  hold  information  about 
the  action.  This  information  includes  the  action’s  AID,  list  of  objects  used  by  the  action,  list 
of  aborted  descendants  of  the  action  ( alist )  and  a  list  of  sites  used  by  the  action’s  descendants 
( plist ). 

The  site  keeps  two  lists  of  actions,  ACTIVE  and  COMMITTED.  The  ACTIVE  list  contains 
AINFOs  of  all  locel  actions  that  are  currently  active.  When  an  active  action  A  terminates,  its 
entry  is  removed  from  ACTIVE.  If  A  aborted,  its  entry  is  discarded  (and  A’s  AID  is  added  to 
its  parent’s  alist).  If  A  committed,  there  are  three  cases: 

•  A  was  a  topaction:  See  Section  2.2.7  below. 

•  A  was  a  handler  call:  A’s  entry  is  placed  in  COMMITTED. 
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•  If  none  of  the  above  then  A  must  have  a  parent  P  in  ACTIVE :  A’s  alist,  plist  and  the 
list  of  objects  are  merged  into  P’s. 

Both  ACTIVE  and  COMMITTED  are  volatile  objects.  When  a  site  crashes,  the  contents 
of  both  lists  are  lost,  which  has  the  effect  of  aborting  the  actions  (that  belong  to  active  trans¬ 
actions)  that  ran  at  that  site. 

2.2.4  Atomic  Objects 

The  use  of  atomic  objects  by  a  set  of  actions  ensures  their  atomic  semantics;  these  actions  would 
be  serializable  and  recoverable.  While  our  model  allows  for  non-atomic  objects  to  exist7,  we 
restrict  this  thesis  to  atomic  objects  only;  therefore  whenever  objects  are  mentioned,  wo  mean 
atomic  objects. 

In  our  model,  atomic  objects  synchronize  the  actions  that  access  them  by  the  use  of  locks. 
Every  operation  on  an  object  is  classified  as  a  read  or  write,  and  an  appropriate  lock  (read-lock 
o.:  write-lock)  must  be  obtained  (automatically,  by  our  system)  before  the  operation  can  take 
place8.  Strict  two- phase  locking  ([Gray,  et  al.  1976])  is  employed,  locks  are  held  until  their 
action  commits  or  aborts. 

The  rules  for  acquiring  locks  are  derived  from  conflicts  between  operations:  A  write  oper¬ 
ation  conflicts  with  any  other  write  or  read  operation.  Without  nesting,  the  rules  are  simple: 
many  concurrent  readers  are  allowed,  but  when  an  action  holds  a  write-lock  on  an  object,  no 
other  lock  can  be  granted  on  that  object.  With  nesting,  the  rules  are  extended:  An  action  A 
can  acquire  a  read-lock  if  and  only  if  all  holders  of  write-lock  are  A’s  ancestors,  and  can  acquire 
a  write-lock  if  and  only  if  all  holders  of  read  or  write  locks  are  A’s  ancestors. 

We  implement  the  locks  on  an  object  as  an  ordered  stack  of  write-locks,  and  unordered 
set  of  read-locks9.  The  way  the  lock  acquisition  rules  are  applied  is  described  in  Figure  2-1 
for  a  read-lock  request,  and  in  Figure  2-2  for  a  write-lock  request10.  Note  that  we  introduce 
additional  requirements  into  the  current  model: 

7For  example,  non-atomic  objects  are  needed  in  order  to  build  user-defined  atomic  types  (see  [Weihl  1984]). 

“Note  that  models  exist  where  locking  takes  place  on  a  higher  level,  like  the  type-specific  locking  in 
[Schwarz  1984]. 

sThe  current  implementation  of  Argus  keeps  all  locks  together,  which  is  a  semantically  confusing. 

10Note  that  a  Can.n0t.have.l0Ck  reply  implies  that  a  query  (explained  below)  has  to  be  sent. 
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checkJocks.forjead  =  proc(Reader:ainfo,  OBJ:object)  returns(reply) 

%  Called  at  a  site  to  check  whether  Reader  can  get  a  READ  lock  on  OBJ 
if  versions.stack$non.empty{OBJ.write.stack)  then 

top.writeJock :  lock  :=  versions-Stack$topJock(OBJ.write_stack) 
if  aid$non.descendant(Reader.aid,top.writeJock.aid)  then 

return(can.notJiaveJock(Reader,OBJ,top-writeJock.aid)) 

end 

end 

%  Reader  is  a  descendant  of  the  top  write  locker  (if  any) 
if  lock.set$member(Reader.aid1OBJ. read-set)  then 
return(already.hasJock(Reader,OBJ)) 

else 

return(can.haveJock(Reader,OBJ)) 

end 

end  checkJockS-for-read 


Figure  2-1:  The  lock  acquisition  rules  applied  for  a  reader 

•  The  lock  acquisition  rules  should  be  satisfied  on  every  access  of  an  action  to  an  object, 
regardless  if  the  action  already  has  the  lock.  This  is  a  result  of  allowing  a  parent  to  run 
concurrently  with  its  descendants  in  our  new  model,  enabling  descendants  to  “snatch” 
locks  from  objects  used  by  their  active  ancestors  ([Xu  &  Liskov  1988]). 

»  A  read-lock  is  required  for  reading  even  if  the  action  has  a  write-lock.  The  reason  has  to 
do  with  updating  timestamps  (see  Section  4.2). 

Atomic  objects  enable  a  site  to  undo  modifications  done  by  an  action  (when  the  action  is 
aborted)  by  using  versions.  The  state  of  an  unlocked  object  is  stored  in  a  base-version ,  and 
new  versions  are  kept  in  a  stack  fashion.  When  an  action  A  acquires  a  write-lock,  a  new  version 
is  created  for  A  with  an  initial  value  of  the  previous  top  version  (or  the  base- version  if  no  other 
version  exists).  The  action  modifies  its  version  only,  and  if  the  action  aborts,  its  version  is 
discarded.  Note  that  write-locks  and  versions  are  tightly  coupled;  hence  we  may  use  those  two 
terms  interchangeably11. 

Note  again  that  versions  on  an  object  are  kept  as  a  stack  with  the  following  invariant  holding: 
Actions  holding  versions  on  some  object  below  some  version  V  are  all  proper  ancestors  of  the 

11  Write-lock  and  its  corresponding  version  are  also  implemented  as  one  object. 
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checkJocks.for.write  =  proc(Writer;ainfo,  OBJ:object)  returns(reply) 

%  Called  at  a  site  to  check  whether  Writer  can  get  a  WRITE  lock  on  OBJ 
for  readJock:lock  in  lock.set$elements(OBJ.readjset)  do 

if  aid$non.descendant(Writer.aid,read  Jock.aid)  then 

return(can.notiiaveJock(Writer,OBJ,readJock.aid)) 

end 

end 

if  versions^tack$non.empty{OBJ.writejstack)  then 

top.writeJock :  lock  :=  versions_stack$topJock(OBJ.write-Stack) 
if  aid$non.descendant(Writer.aid,top-writeJock.aid)  then 

return(can.notJiaveJock(Writer,OBJ,top-writeJock.aid)) 
elseif  Writer.aid  =  top.wriie.lock.aid  then 
return(already.has.lock(Writer,OBJ)) 
end 
end 

%  Writer  is  a  descendant  of  the  top  write  locker  (if  any) 

return(can.have.lock(Writer,OBJ)) 

end  checkJocks-for.write 


Figure  2-2:  T  tack  acquisition  rules  applied  for  a  writer 

holder  of  V.  Actions  holding  versions  below  the  top  'ersion  can  not  access  the  object  (they  are 
blocked,  in  the  current  model),  and  the  action  holding  the  top-version  (and  its  descendants) 
get  the  top-version’s  value  whenever  they  read  the  object. 

When  a  subaction  commits,  its  locks  and  versions  are  propagated  to  its  parent.  The  commit 
of  a  topaction  is  handled  differently  as  described  in  Section  2.2.7  below.  Propagation  means 
that  the  action’s  locks  and  versions  become  the  parent’s,  and  if  the  parent  had  locks  or  versions 
(on  the  same  objects)  before,  they  are  replaced  with  the  new  ones.  All  the  versions  in  the  stack 
above  the  parent’s  new  version  are  discarded  (so  the  parent’s  version  becomes  the  top  one). 
When  a  subaction  (or  topaction)  aborts,  its  locks  and  versions  are  discarded. 

While  base-versions  are  kept  in  stable  storage,  locks  and  versions  are  volatile.  Therefore 
when  a  site  crashes,  not  only  are  all  the  actions  that  exist  there  (of  active  transactions)  aborted, 
but  all  their  locks  and  versions  are  lost. 

Lock  and  version  propagation  is  done  in  a  “lazy”  fashion  when  the  parent  of  the  committing 
action  lives  on  another  site.  Locks  and  versions  of  a  handler  call  H  are  not  modified  to  become 
the  parent’s  when  H  commits.  Also  when  an  action  is  aborted,  abort  messages  are  not  guar- 
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anteed  to  arrive  at  all  the  sites  used  by  the  action’s  descendants;  therefore  some  objects  may 
appear  locally  to  be  locked  even  though  their  current  holder  has  aborted.  Only  when  another 
action  A  tries  to  acquire  a  lock  on  some  object  0 ,  locked  in  conflicting  mode  by  H  (e.g.,  A 
tries  to  write  0,  which  is  read-locked  by  H),  then  a  series  of  queries  is  initiated  in  an  effort 
to  propagate  the  lock  up  the  action  tree  to  the  LCA  of  A  and  H.  Once  the  lock  becomes  the 
LCA’s,  A  can  acquire  it. 

Queries  are  implemented  as  follows:  The  object  O’s  site  (Go)  sends  repeated  messages  to 
the  site  of  the  LCA  of  A  and  H  ( Glca )•  Glca  can  tell  (by  examining  Glca'5  ACTIVE  and 
COMMITTED  lists)  whether: 

1.  H  has  committed  up  to  the  LCA. 

2.  Some  ancestor  of  II  has  aborted. 

3.  Some  ancestor  of  H  is  (possibly)  still  active  elsewhere. 

In  the  first  two  cases,  Go  can  do  the  local  processing  needed  to  grant  a  lock  on  0  to  A  (provided 
no  other  conflicting  locks  exist  on  0).  In  the  third  case,  Go  has  to  repeat  querying  Glca-  A 
possible  optimization  in  this  case  is  to  query  sites  of  ancestors  of  H  that  are  descendants  of  the 
LCA;  if  one  of  these  aborts  then  Go  needs  not  wait  for  a  reply  from  Glca- 

2.2.5  Graphic  Representation 

We  often  use  graphic  representations  of  action  trees  in  examples  given  in  this  thesis.  Figure  2-3 
demonstrates  such  an  action  tree.  The  following  conventions  are  used: 

•  Ovals  represent  sites  (e.g.,  Ga)- 

•  Ellipsis  represent  actions  (e.g.,  A.l). 

•  Rectangular  stacks  represent  objects  (e.g.,  0\ )  with  their  versions  (e.g.,  /Li’s  version). 

•  Direct  arrows  mark  the  relation  parent  — *■  child. 

•  A  curved  arrow  marks  serial  order  between  sibling  subactions. 

•  A  pair  of  parallel  curves  marks  concurrent  sibling  subactions. 
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Figure  2-3:  An  example  for  an  action  tree. 


•  Absence  of  any  of  the  two  previous  marks  means  that  the  relation  between  the  siblings  is 
not  specified. 

For  simplicity,  we  often  ignore  some  irrelevant  details  in  our  pictures.  For  example,  call- 
actions  are  often  ignored  and  sometimes  guardians  or  objects  are  not  drawn  when  we  focus  on 
things  like  the  relation  (serial  or  concurrent)  between  actions  in  the  action  tree. 

2.2.6  Orphans 

An  orphan  is  a  computation  that  is  active  even  though  its  results  are  no  longer  needed.  For 
example,  suppose  an  action  (e.g.,  A  in  Figure  2-3)  aborts  its  handler  call  (e.g.,  A. 2)  due  to  a 
network  partition.  The  handler  call  and  its  descendants  may  still  be  active  after  being  aborted 
since  the  abort  message  has  not  arrived  at  their  site  (Ga.2  in  our  example).  We  call  such  an 
action  (A. 2)  or  any  of  its  descendants  abort  orphan,  since  it  is  a  result  of  an  abort. 
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Another  kind  of  orphan  is  the  crash  orphan.  A  crash  orphan  is  an  action  that  depends  on 
the  volatile  state  of  a  site  that  was  lost  when  that  site  crashed.  For  example,  suppose  that  A.l 
(in  Figure  2-3)  created  a  handler  call  A.1.1,  within  the  site  Gc  and  committed  back  to  A.l.  If 
Gc  crashes  while  A.l  is  still  active,  then  A.l’s  results  would  become  invalid12.  If  Gc  crashes 
after  A.l  had  committed,  but  with  A  still  not  committed,  then  A  and  all  its  descendants  are 
considered  crash  orphans. 

Orphans  are  undesirable  because  they  waste  resources  (taking  processor  time  and  prevent¬ 
ing  other  actions  from  using  objects  locked  by  the  orphans)  and  may  observe  inconsistent 
information,  possibly  resulting  in  unpredictable  behavior. 

Algorithms  to  detect  both  kinds  of  orphans  ([Liskov,  et  al.  1987a])  exist  in  the  current 
model.  The  method  to  detect  abort  orphans  works  by  keeping  track  of  all  aborted  actions  that 
may  had  active  descendants.  It  is  implemented  by  keeping  the  AIDs  of  aborted  actions  in  a  data 
structure  called  done  at  each  site  G  (as  G.done).  Updates  (i.e.,  newly  aborted  actions)  are  prop 
agated  to  other  sites  by  piggybacking  the  local  done  on  (almost)  every  message  M  as  M.don  - 
Sites  abort  all  descendants  of  actions  in  their  done.  A  proof  exists  that  the  algorithm  to  detect 
abort  orphans  prevents  these  orphans  from  seeing  inconsistent  states  ([Herlihy,  et  al.  1987]). 

Crash  orphans  are  detected  by  maintaining,  at  each  site,  a  crashcounter  that  is  incremented 
each  time  a  site  recovered  from  a  crash,  and  a  map  that  reflects  the  site’s  current  knowledge 
about  the  values  of  the  crashcounts  of  other  sites.  Each  action  A  maintains  a  data  structure 
called  the  dlist,  which  contains  the  names  of  all  the  sites  on  which  A  depends ,  i.e.,  whose  crash 
would  make  A  a  crash  orphan.  The  algorithm  works  by  having  messages  propagate  information 
(M.map)  to  sites,  and  “catching”  orphans,  when  a  message  M  arrives  at  a  site  G,  in  one  of  two 
ways: 

•  If  M  is  a  call  or  reply,  and  its  dlist  contains  a  site  C  such  that  G.map(C )  >  M.map(C), 
then  if  the  message  is  a  call,  the  caller  is  an  orphan  (and  no  handler  action  is  created), 
else  the  action  replied  to  is  an  orphan  (since  it  depends  on  its  handler  call)  and  is  aborted. 

•  If  some  site  C  has  to  be  updated  in  G.map  (i.e.,  G.map(C )  <  M.map(C )),  then  any  local 
action  that  depends  on  C  is  aborted. 

I2More  precisely,  this  requires  A.l.l  to  have  locks  in  Gc  or  to  have  made  handler  calls  to  other  sites  (and 
acquire  locks  there)  and  so  on  before  Gc  crashed,  but  this  is  most  often  the  case. 
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Actions’  dlists  are  managed  as  follows:  A  topaction’s  dlist  initially  contains  the  local  site 
only.  A  subaction’s  d’ist  is  initially  its  parent’s  dlist  plus  its  own  site.  When  a  subaction  aborts, 
its  dlist  is  discarded;  when  it  commits,  its  dlist  is  merged  (els  mathematical  set  union)  into  its 
parent’s  dlist.  Finally,  when  a  descendant  inherits  a  lock  from  an  ancestor,  the  ancestor’s  dlist 
(as  of  the  time  the  ancestor  acquired  the  lock)  is  merged  into  the  descendant’s. 

The  reEtson  for  adding  ancestors’  dlists  to  the  dlist  of  the  action  that  inherited  a  lock 
is  concurrency.  Without  concurrency,  only  a  single  point  is  active  in  the  action  tree  at  any 
time  (before  the  topaction  commits),  and  this  point  traverses  the  tree  in  a  depth  first  order, 
so  pEissing  the  dlist  at  creation  and  commit  times  is  enough  to  ensure  that  an  active  action 
always  has  a  correct  dlist.  With  concurrency,  a  concurrent  subaction  S?  may  acquire  a  lock 
on  an  object  that  was  modified  by  a  concurrent  sibling  Si,  thus  making  £2  depend  on  Si’s 
commit,  and  requiring  adding  SVs  dlist  to  S2’s-  The  implementation  simplifies  the  process 
by  taking,  instead  of  Si’s  dlist,  the  dlist  of  LC A(Si,  S2),  which  includes  the  dlist  of  Si  plus 
possibly  dlists  of  other  committed  concurrent  siblings.  The  implementation  is  as  follows13: 
When  a  site  propagates  write-locks  of  some  subaction  S  to  some  ancestor  P  of  S,  P’s  current 
dlist  is  associated  with  each  propagated  version.  When  P  is  not  local,  a  query  is  sent  and  the 
reply  “S  committed  up  to  P”  must  be  received  before  that  propagation  happens.  That  reply 
message  contains  P’s  current  dlist.  Independently  from  the  query  and  propagation  process, 
when  an  action  acquires  a  lock  on  an  object  that  was  modified  by  its  transaction,  all  the  dlists 
associated  with  the  versions  on  that  object  must  be  merged  into  the  action’s  dlist.  Note  that 
associating  dlists  with  versions  allows  us  to  remove  dlists  when  actions  abort  and  their  versions 
are  discarded. 

2.2.7  Committing  Transactions 

Topactions  commit  in  our  model  by  executing  a  top-level  commit  protocol  known  as  Two  Phase 
Commit  ([Gray  1978]).  This  protocol  ensures  that  the  transaction  either  commits  everywhere  or 
aborts  everywhere.  The  participants  in  the  protocol  are  the  sites  in  the  committing  topaction’s 
plist  (see  Section  2.2.3  above);  the  coordinator  is  the  topaction’s  site14. 

13This  process  is  not  clearly  defined  in  [Walker  1984,  Liskov,  et  al.  1987a,  Nguyen  1988].  The  description  here 
is  our  understanding  of  the  way  it  should  be  done. 

14Note  that  the  topaction’s  site  is  also  a  participant. 
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make  a 
decision 


- »-  time 

Figure  2-4:  A  time-space  diagram  of  the  current  Two  Phase  Commit  protocol. 


The  protocol  is  depicted  in  the  time-space  diagram  in  Figure  2-4;  time  elapses  from  left 
to  right,  and  separate  sites  have  separate  horizontal  lines;  the  slanted  arrows  show  messages 
sent  between  the  coordinator  and  a  participant.  Though  only  one  participant  is  shown  in 
the  diagram,  all  the  participants  receive  and  send  similar  messages  in  the  same  time  interval. 
Note  that  our  diagram  shows  only  the  case  where  the  protocol  succeeds;  the  case  where  the 
transaction  is  aborted  is  explained  below. 

The  coordinator  begins  the  first  phase  by  sending  a  PREPARE  message  to  all  the  partici¬ 
pants.  The  message  is  accompanied  by  the  committing  topaction’s  alist.  Each  participant  that 
received  the  PREPARE  message  prepares  locally  by  recording  on  stable  storage  a  tentative 
version  for  every  object  that  was  modified  by  the  transaction.  The  participant  is  considered 
prepared  after  all  the  versions  have  been  recorded  and  a  PREPARE  record  have  been  written  to 
stable  storage,  since  this  allows  it  to  recover  from  a  subsequent  crash  and  continue  participating. 

Since  our  model  propagates  locks  in  a  lazy  fashion,  a  preparing  participant  may  find  several 
versions  stacked  on  an  object  to  be  prepared.  The  participant  can  determine  the  correct  version 
to  record  by  using  the  alist  that  came  with  the  PREPARE  message.  The  correct  version  to 
prepare  is  the  top-most  version  whose  owning  action  A  is  not  in  the  alist  (i.e.,  A  committed  to 
the  top). 

The  participant  that  prepared  successfully  replies  with  OI(  to  the  coordinator;  a  participant 
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that  is  unable  to  prepare  (e.g.,  because  it  has  crashed  and  lost  all  the  relevant  information) 
replies  with  REFUSED.  After  replying,  the  (prepared)  participant  can  release  all  the  read-locks 
held  by  the  transaction. 

The  coordinator  decides  on  the  transaction’s  fate  based  on  the  replies  from  the  participants. 
If  all  replied  with  OK,  the  coordinator  commits  the  transaction  by  writing  a  COMMITTED 
record  to  its  stable  storage.  If  some  participant  responded  with  REFUSED,  or  did  not  respond 
within  a  predetermined  period  (timeout),  the  coordinator  has  to  abort  the  transaction. 

In  the  second  phase,  the  coordinator  notifies  the  participants  of  its  decision.  If  it  decided 
to  commit,  it  sends  a  COMMIT  message  to  the  participants,  which  record  the  commit  on 
stable  storage,  replace  the  base  versions  of  the  transaction’s  objects  with  the  tentative  versions, 
release  the  remaining  locks  of  the  transaction  and  reply  with  DONE.  If  the  coordinator  decides 
to  abort  the  transaction,  it  sends  an  ABORT  message,  and  the  participants  discard  all  the 
transaction’s  locks  and  tentative  versions. 

A  participant  P  that  did  not  receive  the  coordinator’s  decision  (e.g.,  because  P  crashed 
before  the  second  phase)  may  query  the  coordinator  later  for  its  decision,  which  must  be  kept 
by  the  coordinator’s  site  (unless  all  the  participants  replied  with  DONE). 
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Chapter  3 


Disconnected  Actions 


The  purpose  of  this  thesis  is  to  extend  our  current  model  of  computation  for  a  nested  atomic 
action  system,  as  presented  in  Chapter  2,  to  accommodate  disconnected  actions.  This  chapter 
describes  how  DAs  fit  into  the  model,  as  opposed  to  regular  (i.e.  not  disconnected)  subactions. 

This  chapter  is  organized  as  follows:  We  begin  by  describing  DAs  and  the  role  that  they  fill 
in  a  nested  action  system  and  their  semantics.  Next  we  describe  the  design,  terminology  and 
implementation  of  DAs  in  our  model.  And  the  last  two  sections  describe  the  necessary  changes 
to  the  query  and  orphan  detection  mechanisms  to  support  DAs. 

3.1  What  are  disconnected  actions? 

Nested  actions  enable  programs  that  use  atomic  actions  to  exploit  parallelism  by  providing  a 
tool  to  performs  several  tasks  concurrently  within  an  action.  However,  there  are  cases  where 
nested  actions  can  not  take  advantage  of  potential  parallelism,  and  independent  tasks  have  to 
be  done  sequentially.  For  example,  many  abstract  data  objects  provide  operations  that  perform 
some  abstract  modification  to  the  object;  the  implementation  of  these  operations  often  needs 
to  do  more  work  internally  on  the  object’s  representation  to  save  some  work  for  the  subsequent 
operations.  The  additional  improvement  work  can  be  done  in  parallel  with  the  work  that  the 
caller  (i.e.,  the  action  that  called  the  operation)  does  after  the  operation.  Doing  these  two  tasks 
in  parallel  is  not  possible  in  our  current  model  of  nested  actions. 

Take  the  Fibonacci  Heap  ([Fredman  &  Tarjan  1984])  as  an  example  for  an  abstract  data 
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object.  With  n  elements  in  the  heap,  the  common  operation  of  extract-minimum  takes  a  time 
of  O(lgn),  while  read-minimum  is  brief  (0(1)).  The  difference  in  time  results  from  the  work 
that  has  to  be  done  to  rearrange  the  heap  after  extraction  of  the  minimum.  A  program  that  has 
to  extract  the  minimum  can  be  speeded-up  by  reading  the  minimum  first  (and  marking  it  as 
obsolete),  and  doing  the  extraction  work  “in  the  background”,  in  parallel  with  the  continuing 
program.  To  use  such  a  speed-up  technique  in  the  current  model,  the  program  needs  to  do  the 
following: 


begin  topaction  %  A 

begin  action  %  A.1 

min  :=  fibonacci-heap$read.min.and.markJt() 
coenter  %  create  a  set  of  concurrent  subactions 

action  fibonacci-heap$finish_extracting_minimum() 
action  do.resLof-work  %  of  A.1 

end 

end  %A.1 
do_rest  %  of  A 

end  %A 

The  above  method  has  the  following  drawbacks:  Violation  of  abstraction  (the  caller  needs  to 
know  about  the  two  operations),  awkward  programming  (squeezing  the  real  body  of  the  action 
A.1  into  a  concurrent  subaction),  and  limited  scope  (the  operation  finish.oxtractingjninimum  can 
not  run  in  parallel  with  the  rest  of  the  transaction,  i.e.,  A). 

Cases  that  require  “background”  work,  as  described  above,  can  be  better  handled  when 
the  current  model  is  extended  to  have  Disconnected  Actions  (DAs).  A  DA  is  a  subaction  that 
executes  in  parallel  with  the  rest  of  the  transaction,  and  performs  benevolent  side-effects  within 
the  parent  action.  An  action  A  that  creates  a  DA  D  need  not  be  blocked  until  D  terminates  (as 
is  the  case  with  creating  regular  subactions);  A  continues  to  run  in  parallel  with  its  disconnected 
child  D.  The  disconnected  action  D  is  executed  as  a  subaction,  but  asynchronously  with  its 
transaction;  that  is,  its  creator  A  can  run  and  commit  (assuming  A  is  not  a  topaction)  to 
A’s  creator,  and  so  on,  without  any  effect  on  D.  The  only  event  that  synchronizes  D  with 
its  transaction  is  the  termination  of  D’s  topaction;  no  (non-orphan)  DA  is  left  active  after  its 
topaction  terminates. 

With  the  use  of  DAs,  the  previous  example  can  result  in  a  cleaner  program  and  the  imple¬ 
mentation  of  the  abstract  operation  can  hide  the  “background”  work: 
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extractjnin  =  handler  ()  returns  (element) 

min  :  element fibonaccLheap$read-min_and.markJt() 
disconnected-action  fibonacci-heap$finish.extracting.minimum() 
return  (min) 
end 

and  the  program  can  use  the  operation  in  a  simple  manner: 


begin  topaction  %  A 

begin  action  %  A.  1 

min  :=  fibonacci-heap$extract-min() 
do  .rest  jof.work  %  of  A.  1 

end  %  A.  1 

do-rest  %  of  A.  (Note  that  the  DA  may  still  be  active!) 
end  %  A  (including  the  DA) 


3.1.1  Kinds  of  Disconnected  actions 

A  disconnected  action  can  be  used  by  a  programmer  as  a  single  operation  that  creates  some 
benevolent  side-effects  (like  improving  the  representation  of  an  object,  as  described  above). 

Another  use  of  DAs  by  programmers  is  for  quorum-sets  that  perform  successfully  at  least  a 
quorum  M  of  operations  out  of  a  set  of  N  (such  that  M  <  N).  For  example,  quorum-sets  can 
be  used  to  update  replicated  objects,  as  described  in  Section  1.1.  The  programmer  has  to  use 
a  statement  like: 

do  M  of  {OP\  ,OP2f  •  •  »OPn} 

and  handle  an  exception  when  fewer  than  M  actions  (operations)  can  commit. 

3.1.2  Semantics  of  Disconnected  Actions 

The  commit  of  a  disconnected  action,  like  that  of  any  subaction,  is  relative  to  its  parent;  if  the 
parent  (or  any  of  its  ancestors)  aborts,  all  the  effects  done  by  the  DA  are  undone. 

The  DA  is  terminated  by  the  time  its  topaction  terminates.  Since  DAs  run  asynchronously 
with  the  rest  of  the  transaction,  the  commit  protocol  of  the  topaction  must  be  modified  to 
ensure  termination  of  DAs  of  that  transaction.  These  changes  are  discussed  in  Chapter  5. 

The  fate  (i.e.,  commit  or  abort)  of  a  DA  does  not  effect  the  commit  decision  of  any  of  its 
proper  ancestors  (except  the  case  of  quorum-set  DAs  discussed  below).  Therefore,  DAs  can  not 
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be  used  for  doing  “real”  work,  only  for  benevolent  side-effects. 

Disconnected  actions  must  have  the  same  termination  semantics  as  regular  subactions;  a 
disconnected  action  must  appear,  to  the  rest  of  the  transaction,  to  be  terminated  as  soon  as  its 
creator  resumes  its  execution.  Since  DAs  are  active  after  their  creator  continues  (unlike  regular 
subactions),  changes  must  be  made  to  the  current  model  to  ensure  serialization  of  DAs;  these 
changes  are  discussed  in  the  Chapter  4. 


3.2  Adding  Disconnected  Actions  to  the  Current  Model 

This  section  describes  issues  related  to  design,  terminology  and  implementation  of  our  system 
model  with  disconnected  actions. 

A  goal  in  the  design  of  the  new  model  was  to  minimize  the  change  to  the  design  of  the  current 
model  and  its  performance.  This  goal  was  achieved;  when  DAs  are  not  used,  the  new  model 
does  not  send  any  extra  messages,  does  very  little  additional  local  processing,  and  requires  little 
additional  storage. 

The  term  top  DA  is  used  for  a  disconnected  action  that  was  created  by  a  regular  action1 . 
A  top  DA  can  run  and  create  subactions  like  any  other  action.  The  term  disconnected  action 
(DA)  is  used  in  this  thesis  to  refer  to  any  of  the  top  DA  or  its  descendants. 

Our  model  of  disconnected  actions  can  be  generalized  to  have  nested  disconnected  actions ; 
that  is,  disconnected  actions  created  by  disconnected  actions.  Allowing  disconnected  actions 
to  be  created  recursively  can  be  useful,  for  example,  when  DAs  are  used  to  implement  abstract 
services,  which  in  turn  may  be  used  by  DAs. 

Most  of  the  work  in  this  thesis  is  concerned  with  non-nested  DAs.  Nested  DAs  are  discussed 
in  various  level  of  detail;  in  most  cases  they  seem  to  require  only  a  straightforward  extension 
of  the  work  for  non-nested  DAs. 

All  the  quorum  operations  are  run  as  DAs2.  The  creator  of  the  set  of  ( N )  concurrent 
operations  is  blocked  until  at  least  a  quorum  of  ( M )  disconnected  subactions  commit.  This 
way  there  is  no  need  to  specify  a  priori  which  operations  are  in  the  quorum,  and  the  rest  get  an 

'The  term  top  DA  is  also  used  for  a  nested  DA  created  by  a  DA  and  so  on. 

2Unlike  the  simplistic  description  in  Section  1.1,  where  the  quorum  was  achieved  by  regular  subactions  first, 
and  DAs  were  then  made  to  finish  up. 
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Figure  3-1:  An  example  of  a  Disconnected  Action 


early  start.  Note  that  since  the  creator  is  blocked  while  those  DAs  in  the  quorum  are  active, 
their  chance  of  success  is  equivalent  to  that  of  regular  subactions  as  they  do  not  conflict  with 
later  activity  (see  Chapter  4). 

3.2.1  Graphic  Representation  of  DAs 

Figure  3-1  shows  an  example  of  an  action  A  that  created  a  disconnected  action  A.D  1.  We 
use  a  dashed  oval  to  represent  a  disconnected  action  and  a  dashed  arrow  to  mark  the  relation 
parent_action  —  -  -+  child_top_DA.  The  DA  (A.D  1)  can  create  subactions  like  any  other 
action,  therefore  we  use  the  continuous  arrows  to  mark  relations  between  a  disconnected  action 
and  its  children  (e.g.  between  A.D  1  and  A.Dl.l).  Dashed  arrows  are  used  only  for  marking  the 
“disconnected  branch”  in  the  action  tree.  Though  the  “disconnection”  always  happens  inside 
a  site,  we  may  simplify  some  descriptions  and  draw  dashed  arrows  across  site  boundaries  (e.g., 
in  Figure  6-1). 
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3.2.2  Implementation  of  Disconnected  Actions 

The  top  disconnected  action  is  implemented  as  two  actions,  a  parent  and  a  child.  The  parent 
does  no  real  work  (similar  to  a  call-action  or  a  concurrent- set  action);  it  only  creates  a  child 
and  commits  after  the  child  terminated,  The  child  does  the  work  like  any  regular  subaction. 
The  reason  for  breaking  the  top  DA  in  two  is  that  some  record  must  be  kept  at  the  site  when 
the  top  DA  aborts;  since  the  parent  always  commits,  it  will  be  entered  in  COMMITTED ,  and 
its  child  will  be  in  its  alist  if  the  child  aborts. 

Putting  the  DA’s  entry  in  COMMITTED  provides  a  simple  way  to  keep  separately  infor¬ 
mation  about  that  DA,  including  its  plist  (needed  by  the  algorithm  in  Chapter  6),  its  dlist 
(needed  by  the  modified  orphan  detection  scheme  in  Section  3.4)  and  its  alist  (needed  by  the 
modified  Two  Phase  Commit  protocol  in  Chapter  5). 

Concurrent-set  actions,  of  only  explanatory  value  in  the  current  model,  need  to  be  imple¬ 
mented  in  the  model  with  DAs.  They  should  be  stored  in  COMMITTED  after  committing 
when  they  are  used  for  quorum  sets  because  they  maintain  data  structures  that  are  used  by 
active  quorum  DAs  and  needed  for  our  fault-tolerant  commit  method  (described  in  Chapter  6). 

Like  call-actions,  concurrent-set  actions  and  the  distinction  between  parent  and  child  for 
top  DAs  are  often  ignored  in  our  examples  and  discussions. 

Quorum  sets  are  implemented  as  follows:  The  concurrent-set  action  C,  which  creates  a 
set  of  concurrent  quorum  DAs,  maintains  a  data  structure  to  monitor  the  current  state  of  its 
quorum  DAs.  C’s  parent  blocks  when  C  is  created,  then  C  creates  its  concurrent  set  of  N  DAs 
and  blocks  until  a  minimal  quorum  of  Q  DAs  has  committed,  at  which  point  C  commits  and  its 
parent  continues.  If  a  quorum  can  not  be  achieved  (i.e.,  at  least  N  -  M  +  1  DAs  aborted),  C 
aborts.  Whether  C  is  active  or  committed,  its  cite  keeps  monitoring  the  state  of  its  DAs  until 
they  all  terminate;  this  information,  kept  with  C's  entry  (in  COMMITTED ),  is  needed  later 
by  the  Fault  Tolerant  Two  Phase  Commit  method  (see  Chapter  6). 

3.3  Queries 

To  use  disconnected  actions,  the  current  lock  query  mechanism  must  be  changed.  In  the  current 
model,  when  some  action  S  needs  a  lock  on  an  object  that  is  locked  in  a  conflicting  mode  by  an 
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action  X ,  queries  must  be  sent  to  the  site  of  LCA(S,  X )  to  find  out  if  X  committed  all  the  way 
to  the  LCA.  Queries  can  be  sent  to  sites  of  ancestors  of  X  (that  are  descendants  of  the  LCA) 
as  an  optimization  to  detect  a  possible  abort  of  an  ancestor  of  X  earlier. 

When  a  lock  holder  is  a  disconnected  action,  queries  to  the  LCA  are  not  sufficient  to 
determine  that  the  lock  holder  committed  to  the  LCA.  An  ancestor  P,  disconnected  from  the 
locker  X,  may  commit  to  the  LCA  while  X  is  still  active.  Only  when  the  top  DA  ancestor  of  X 
commits  (and  P  also  commits  to  the  LCA(S,  L))  can  X’s  site  propagate  X’ s  locks  and  versions 
to  LCA(S,L).  To  generalize  for  the  case  of  nested  DAs,  lock  propagation  from  a  DA  X  to  an 
ancestor  A  requires  that  all  top  DAs  between  X  and  the  A  commit,  in  addition  to  the  current 
requirement  that  an  ancestor  of  L  that  is  a  child  of  A  must  commit.  As  before,  if  any  ancestor 
of  X  was  found  to  be  aborted,  X’s  locks  and  versions  are  discarded. 

To  grant  a  lock  on  an  object  locked  (in  conflicting  mode)  by  a  DA  D,  queries  must  be  sent 
to  the  LCA  and  to  all  the  sites  where  top  DAs,  ancestors  of  X|  that  are  descendants  of  the 
LCA,  were  created.  Figure  3-2  depicts  an  example  of  an  action  A. 2  that  tries  to  get  a  lock  on 
the  object  0  that  is  (write)  locked  by  the  nested  DA  G1.A.G2.GZ.D1.G4.G5.D1.G6.GO .  The 
thick  arrows  mark  the  queries  that  Go  must  send.  Go  deduces  from  the  locker’s  AID  that  it 
has  to  send  queries  to  Gi,  the  site  of  A  (LCA  of  A. 2  and  the  locker),  to  G3>  the  site  of  the  top 
DA  Gl-A.G2.GZ.D1  and  to  G$,  the  site  of  the  nested  top  DA  G1.A.G2.GZ.D1.G4.G5.D1.  All 
the  queried  sites  must  reply  that  the  relevant  action  committed  before  Go  can  grant  the  lock 
(and  the  top  version)  on  0  to  A. 2;  if  some  site  informs  that  the  relevant  action  aborted,  the 
locker’s  locks  and  versions  should  be  discarded. 

The  implementation  of  the  modified  query  mechanism  uses  the  locker’s  AID  to  deduce 
the  list  of  sites  and  ancestors  (of  the  locker)  for  the  queries  (the  needed  operation,  “loca- 
tions.of-disconnected.ancestors”,  is  described  in  Appendix  A).  The  site  that  initiates  the 
queries  traverses  the  AID  of  the  disconnected  locker  in  a  bottom-up  order,  propagating  the 
ownership  of  locks  and  versions  up  the  tree;  only  when  a  reply  contains  information  about  the 
commit  of  a  top  DA  D  can  the  the  ownership  pass  to  D' s  parent. 
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Figure  3-2:  An  example  for  a  site  making  queries  about  the  fate  of  a  lock  owner. 
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3.4  Interaction  with  the  Orphan  Detection  mechanism 


The  orphan  detection  algorithms  are  distributed  algorithms  that  propagate  knowledge  among 
sites  piggybacked  on  messages  sent  by  the  transactions.  The  crash  orphan  detection  algorithm 
also  associates  information  with  the  actions  themselves. 

The  only  difference  between  the  system  model  that  uses  DAs  and  the  current  one,  from 
orphan  detection  point  of  view,  is  that  some  information  is  not  passed  from  a  committing  top 
DA  to  its  parent,  unlike  the  case  between  any  other  action  and  its  parent. 

The  abort  orphan  detection  algorithm  associates  no  knowledge  with  specific  action-  and 
does  not  pass  any  information  from  a  committed  action  to  its  parent.  Therefore  this  algorithm 
should  work  in  a  system  model  with  DAs  as  before. 

The  crash  orphan  detection  algorithm  propagates  knowledge  among  sites  in  a  manner  similar 
to  the  algorithm  for  abort  orphans,  but  in  addition  associates  dlists  with  actions.  Updates  to 
dlists  travel  along  branches  of  the  action  tree  (with  subaction’s  creation  and  commit)  and  with 
(“committed”)  replies  to  lock  queries  to  the  LCA. 

In  a  model  with  DAs,  the  dlist  of  a  top  DA  D  is  not  merged  into  the  dlist  of  its  parent 
when  D  commits  (because  the  parent,  and  several  other  ancestors,  may  have  committed  by 
that  time).  This  poses  no  problem  for  an  ancestor  P  of  a  DA  D  that  does  not  belong  to  a 
quorum,  because  P  can  commit  even  if  some  participant  of  D  crashes,  as  explained  in  Chapter  6. 
However,  when  some  relative  R  of  D  observes  D’s  effects,  it  becomes  dependent  on  D’s  commit 
(and  dlist).  If  R  is  a  regular  subaction  and  later  commits  to  P,  P  would  also  depend  then  on 
D’s  dlist. 

An  action  observing  effects  of  a  relative  DA  can  get  the  DA’s  dlist  using  the  extended  query 
mechanism  (Section  3.3).  Basically  replies  to  queries  sent  to  sites  of  top  DA  should  also  include 
the  committed  top  DA’s  dlist.  As  can  be  seen  in  the  example  of  Figure  3-2,  the  site  with  the 
object  (Go)  sends  queries  to  G$  and  G$,  in  addition  to  the  LCA’s  site  G\.  The  combination 
of  the  dlists  of  the  three  replies  would  cover  the  dlist  of  the  version  holder  that  committed  to 
the  LCA. 

A  problem  does  exist  with  DAs  that  are  used  for  quorum  sets.  An  ancestor  of  a  quorum 
set  depends  on  the  commit  of  at  least  a  quorum  of  its  descendant  DAs.  Three  solutions  are 
proposed  here: 
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•  Do  nothing.  Some  crash  orphans  may  not  be  detected. 

•  Pass  the  dlists  of  the  DAs  that  committed  as  part  of  the  quorum  to  the  concurrent-set 
action.  Some  non-orphans  may  be  detected  as  crash  orphans. 

•  Pass  more  information  with  the  dlists  of  committed  DAs  to  the  concurrent-set  action. 
When  an  action  is  detected  as  an  orphan  because  some  site  used  by  a  DA  of  the  quorum 
crashed,  the  site  that  created  the  concurrent- set  action  must  be  queried  to  know  if  the 
quorum  minimum  was  violated. 

The  last  solution  is  too  complex,  and  requires  a  change  to  the  orphan  detection  algorithm. 
With  the  use  of  the  fault  tolerant  commit  algorithm  discussed  in  Chapter  6,  quorums  sets  done 
using  DAs  are  likely  to  survive  crashes,  so  the  first  solution  is  favored;  without  that  algorithm, 
the  second  solution  can  be  used. 
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Chapter  4 


Serialization  of  Disconnected 
Actions 


In  this  chapter  we  describe  the  way  our  new  model,  which  supports  disconnected  actions,  is 
extended  to  ensure  serialization  of  atomic  actions.  In  the  current  model,  actions  were  serialized 
by  obeying  a  strict  two-phase  locking  protocol;  disconnected  actions,  which  run  asynchronously 
with  their  parents,  may  violate  that  protocol.  This  chapter  presents  additional  rules  and  mech¬ 
anisms  that  enforce  serial  behavior  in  our  new  model. 

One  atomic  action  is  serialized  after  another  if  they  access  atomic  objects  in  conflicting 
mode;  that  is,  at  least  one  of  the  two  updates  the  object.  In  this  chapter,  whenever  we  talk 
about  an  access  to  an  object,  we  mean  access  in  conflicting  mode. 

An  important  point  in  our  serialization  mechanism  is  the  ordering  relation  between  the  two 
relative  actions  (i.e.,  of  the  same  transaction)  that  access  an  atomic  object,  the  one  that  got  a 
lock  on  the  object  and  the  other  that  tries  to  get  a  conflicting  lock.  The  two  can  be  serially 
related,  which  means  that  their  LCA  created  their  ancestors  to  run  in  a  predetermined  order,  or 
concurrent  relatives  that  run  concurrently  and  get  serialized  by  the  use  of  atomic  objects.  We 
distinguish  between  the  case  of  “serial  DAs”,  which  are  DAs  related  serially  to  other  actions, 
and  “concurrent  DAs”,  which  are  concurrently  related  to  other  actions,  as  we  describe  our 
solutions  that  handle  the  two  cases  separately  . 

We  devised  two  serialization  mechanisms,  one  for  each  ordering  relation.  In  the  serial  case, 
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where  predetermined  information  about  the  order  of  the  related  action  exists,  we  use  a  simple 
timestamp  mechanism  to  enforce  this  order.  For  the  concurrent  case,  where  the  ordering  is  set 
dynamically  in  various  locations,  we  use  a  mechanism  of  constraint  tables  to  centrally  monitor 
the  ordering  and  prevent  disconnected  actions  from  violating  this  order.  The  constraint  tables 
mechanism  also  requires  introduction  of  a  new  kind  of  lock,  the  fingerprint  lock,  to  maintain 
information  about  access  history  to  objects. 

We  give  only  intuitive  reasoning  for  the  correct  functionality  of  our  new  rules  and  mecha¬ 
nisms;  no  formal  correctness  proof  is  given.  Our  belief  in  the  correctness  of  the  new  mechanisms 
is  also  supported  by  a  successful  simulation  we  did  of  the  new  model. 

The  chapter  is  organized  as  follows:  Section  4.1  presents  an  example  for  a  non-serializable 
execution  as  a  result  of  using  DAs.  Section  4.2  handles  only  the  case  of  serial  DAs  by  introducing 
the  timestamp  mechanism  and  additional  lock  acquisition  rules  that  use  timestamps.  Section  4.3 
completes  the  work  by  introducing  the  constraint  tables  mechanism  to  handle  concurrent  DAs 
and  presenting  the  full  scheme  for  lock  propagation  and  acquisition  in  our  new  model.  We 
conclude  in  Section  4.4. 


4.1  The  serialization  problem 

The  following  simplified  example  demonstrates  the  serialization  consistency  problem.  Suppose 
we  have  a  replicated  counter  C,  composed  of  three  replicas  Ci,C2  and  C3  (represented  as 
(Valuei,  Value2,  Valuer)  ).  Read  and  write  operations  are  performed  on  the  counter  using  a 
simple  majority  scheme,  similar  to  our  initial  example  (Section  1.1):  Read  from  two  replicas; 
write  to  two  replicas.  The  write  operation  is  augmented  by  a  disconnected  action  that  updates 
the  third  replica. 

Figure  4-1  shows  the  action  tree1  of  an  action  A  that  tries  to  increment  the  replicated 
counter  C  twice.  For  each  increment  operation  A  creates  a  subaction  that  reads  the  counter, 
calculates  the  new  value,  writes  the  new  value  to  two  replicas  and  leaves  a  DA  behind  to  write 
to  the  third  replica  after  the  subaction  commits. 

Note  that  the  ordering  relation  between  the  two  increment  subactions,  A.  1  and  A. 2,  was 

1Sites,  call-actions,  etc.,  are  ignored  for  simplicity. 
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Incr(C ) 


Incr(C) 


Figure  4-1:  An  action  tree  for  the  action  A  that  updated  the  counter  C  twice. 

not  specified.  The  non-serializable  scenario,  describe  below,  can  take  place  when  the  two 
subactions  are  either  concurrent  or  serial  siblings.  The  relationship  between  the  siblings  becomes 
important,  however,  for  the  design  of  the  solution. 

The  non-serializable  execution  is  shown  in  the  time-space  diagram  in  Figure  4-2.  The  thick 
line  is  the  thread  of  execution,  with  the  time  passing  from  left  to  right.  The  dashed  line  marks 
execution  of  DAs.  The  vertical  axis  shows  separate  locations  (i.e.,  actions).  Read  and  write 
operations  are  given  above  the  thread  line.  The  states  of  the  counter  C  are  shown  on  the 
bottom  of  the  figure. 

The  execution  is  as  follows:  The  counter’s  value  is  (6, 6, 6)  when  A  begins  and  creates  A. I2. 
A.l  reads  the  value  6  from  the  counter,  writes  7  to  leplicas  C\  and  C2,  creates  a  DA  (A.l.Dl) 
to  write  7  to  C3  and  commits  to  A.  Next  A. 2  reads  the  value  7  from  the  counter,  writes  8  to 
replicas  C2  and  C3,  creates  a  DA  {A.2.DI)  to  write  8  to  C\  and  commits  to  A.  Next  the  latter 
DA,  A.2.DI,  aborts  and  does  not  affect  C\.  The  first  DA,  A.l.Dl,  runs  in  parallel  with  the  rest 
of  the  transaction  and  manages  to  write  7  to  replica  C3  and  commit  just  before  A  commits. 

2Though  both  the  thread  graph  and  the  AIDs  show  A.l  and  A. 2  as  serial  siblings,  the  same  execution  can 
take  place  with  the  two  running  concurrently. 
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Figure  4-2:  The  thread  of  execution  of  the  action  A  with  disconnected  actions  that  led  to  an 
inconsistent  state  of  the  counter  C 

The  result  is  a  non-serializable  execution!  Starting  with  the  consistent  state  C  =  (6, 6, 6)  , 
the  action  A  performed  successfully  two  increment  operations  and  yielded  an  inconsistent  state. 
Actions  following  A  may  read  the  counter  as  either  (7,?,  7)  =7  or  (7,8,?)  =  8  .  The 
current  lock  inheritance  and  acquisition  rules  were  not  violated:  A  inherited  the  locks  of  its 
committing  descendants,  and  each  of  them  acquired  locks  that  were  either  free  or  held  by  A; 
yet  the  execution  of  the  two  subactions  is  not  serializable  because  no  order  exists  such  that 
performing  the  two  subactions  serially  in  that  order  would  produce  the  same  effects. 

The  problem  was  caused  by  the  Write{Cz  <—  7)  operation  of  the  disconnected  action 
A.l.Dl.  Because  it  wrote  to  the  replica  Cz  after  A.2  has  also  written  to  it,  A.1.D1  and  all 
its  ancestors  up  to  the  least  common  ancestor  A  (i.e.,  A.l)  ought  to  come  after  A.2  in  any 
equivalent  serial  execution,  while  the  writing  order  to  Cz  in  our  example  forces  the  opposite 
order.  The  DA  A.1.D1  violated  the  basic  rule  of  two-phase  locking;  it  acquired  a  lock  (on  C3 ) 
after  its  ancestor  begun  releasing  locks  (on  C2). 

The  problem  could  have  been  solved  if  knowledge  about  the  serialization  order  existed  at 
the  site  of  C3  to  prevent  A.l.Dl  from  writing  to  Cz  after  A. 2  did  so.  In  this  particular  example 
such  knowledge  did  exist,  but  at  the  application  level,  in  the  form  of  replica  version  numbers; 
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replica  C3  with  value  8  has  a  higher  version  number  than  the  one  to  be  written  by  .4.1. PI.  We 
would  like  to  have  a  general  mechanism  to  provide  similar  knowledge  at  the  system  level. 


4.2  Using  Timestamps  to  Serialize  Serial  DAs 

The  serialization  problem  can  be  solved  for  lock  conflicts  between  serially  related  (disconnected) 
actions  with  the  use  of  timestamps.  The  timestamp  mechanism  proposed  below  enforces  cre¬ 
ation  order  on  DAs;  therefore  it  can  have  no  effect  when  the  conflict  is  between  concurrent 
(disconnected)  relatives.  Section  4.3  proposes  another  mechanism,  constraint  tables,  that  han¬ 
dles  concurrent  DAs.  Though  constraint  tables  can  be  extended  to  handle  conflicts  between 
serial  relatives  as  well,  their  implementation  is  expensive.  The  timestamp  mechanism  is  very 
efficient  (e.g.,  no  extra  messages  needed)  and  handles  most  of  the  lock  conflicts;  therefore  both 
mechanisms  are  implemented,  and  the  decision  about  granting  a  lock  to  a  disconnected  action 
is  made  based  on  the  ordering  relation  between  that  DA  and  the  owner  of  the  conflicting  lock. 

The  timestamp  mechanism  proposed  here  enables  a  disconnected  action  P  to  tell  that  an 
object  0  was  used  by  some  future  action  (e.g.,  later  serial  sibling),  thus  preventing  P  from 
acquiring  a  conflicting  lock  on  0.  The  use  of  timestamps  divides  the  execution  into  intervals  of 
pseudo-time;  a  new  interval  begins  immediately  after  an  action  A  creates  a  (top)  disconnected 
action  P  (or  a  concurrent  set  of  DAs),  before  A  continues.  Everything  that  A  and  its  ancestors 
do  after  P’s  creation  belong  to  later  intervals  of  pseudo-time,  and  P  can  only  observe  effects 
that  were  created  in  its  interval  or  previous  ones,  and  change  them  if  no  other  later  action  has 
yet  observed  these  effects. 

The  implementation  maintains  a  timestamp  with  every  action  and  every  lock.  Actions’ 
timestamps  are  increased  monotonically  during  the  execution  of  the  transaction;  locks’  times¬ 
tamps  are  updated  when  actions  access  objects.  Thus  enough  information  exists  at  a  site  to 
decide  whether  a  DA  requesting  a  lock  can  get  it  or  must  abort. 

The  use  of  timestamps  can  solve  the  serialization  problem  of  Section  4.1  for  serial  siblings. 
Figure  4-3  shows  again  the  action  tree  for  A  that  created  serial  subactions.  The  actions’ 
timestamps  are  shown  (boxed)  for  an  execution  that  traverses  the  tree  in  a  depth-first,  left 
to  right  manner.  A  starts  with  a  zero  timestamp  and  creates  A.  1  with  the  same  timestamp. 
A.l  creates  a  DA  A.l.Dl,  which  is  also  given  its  parents’  timestamp,  and  A.l’s  timestamp  is 
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0  HI 

Figure  4-3:  An  action  tree,  with  timestamps,  for  the  action  A  that  updated  the  counter  C  twice 
using  serial  subactions. 


Figure  4-4:  The  thread  of  execution  of  the  action  A  with  disconnected  actions  that  uses  times¬ 
tamps  to  maintain  a  consistent  state  of  the  counter  C 
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incremented  to  1  before  it  continues.  When  >1.1  commits  to  A,  A’s  timestamp  is  updated  to 
match  A.l’s,  and  a  similar  scenario  takes  place  for  A. 2,  which  commits  to  A  with  a  timestamp 
of  2. 

Figure  4-4  shows  the  thread  of  execution  and  the  states  of  the  object,  with  their  correspond¬ 
ing  timestamps  (*  marks  an  initial  value).  As  seen,  every  modified  replica  has  the  timestamp 
that  the  action  that  modified  it  had  at  the  time  of  the  modification.  When  the  DA  A.l.Dl 
tries  to  modify  C3,  the  timestamp  there  has  the  value  1  reflecting  A.2’s  access  to  C3.  Since 
A.I.jDI’s  timestamp  is  only  0,  it  should  not  be  allowed  to  read  or  write  C3,  which  has  a  later 
value,  and  must  be  aborted. 

This  rest  of  this  section  describes  first  how  new  locking  rules  must  be  added  to  the  cur¬ 
rent  ones  to  ensure  serialization  of  serial  DAs  by  using  the  timestamp  mechanism.  Next  the 
implementation  is  described,  followed  by  extension  of  the  mechanism  for  nested  DAs. 

4.2.1  Additional  locking  rules 

The  new  additional  lock  acquisition  rules  are  summarized  in  Figure  4-5;  they  apply  only  if  the 
action  D  that  requests  a  lock  is  disconnected.  The  new  rules  prevent  D  from  observing  or 
affecting  a  state  that  was  created  by  a  later  ordered  (serial)  relative.  Failure  to  satisfy  the  new 
rules  implies  that  D  should  be  aborted.  Note  that 


•  Acquiring  a  write  lock 

—  D' s  timestamp  must  be  no  smaller  than  the  timestamp  of  any  lock  holder 
on  X. 

•  Acquiring  a  read  lock 

—  D' s  timestamp  must  be  no  smaller  than  the  timestamp  of  any  write  lock 
holder  on  X. 


Figure  4-5:  Additional  lock  acquisition  rules  for  a  DA  D  on  an  object  X 

The  new  rules  are  applied  only  after  the  current  rules  approved  granting  a  lock  to  a  DA 
D ;  this  way  the  new  rules  are  applied  only  to  conflicting  locks  held  by  ancestors  of  D.  Note 
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that  unnecessary  aborts  of  DAs  are  prevented;  a  DA  requesting  a  lock  on  an  object  need  not 
abort  when  a  later  sibling  is  found  to  hold  a  conflicting  lock  because  timestamps  are  checked 
only  after  that  sibling  commits  to  a  common  ancestor  or  aborts.  For  example,  the  DA  A.1.D1 
(in  Figure  4-1)  need  not  be  aborted  if  A. 2,  which  has  a  higher  (i.e.,  later)  timestamp,  is  holding 
a  lock  on  C$  at  the  time  A.1.D1  requested  a  lock.  It  is  possible  that  A. 2  would  later  abort, 
its  locks  would  be  discarded  and  A.1.D1  would  be  able  to  acquire  a  lock  on  C3.  A.l.Dl  has  to 
wait  until  A. 2  either  commits  to  A  or  aborts. 


checkJocks-for_read  =  proc(Reader:ainfo,  OBJ:object)  returns(reply)  signals(abort_DA(aid)) 

%  Called  at  a  site  to  check  whether  Reader  can  get  a  READ  lock  on  OBJ 
if  versions-Stack$non.empty(OBJ.write.stack)  then 

top.writeJock :  lock  :=  versions_stack$topJock(OBJ.write_stack) 
if  aid$non_descendant(Reader.aid,top_writeJock.aid)  then 

return(can.not-haveJock(Reader,OBJ,top.write.lock.aid)) 
j  elseif  aid$disconnected(Reader.aid)  cand  Reader.ts  <  top_write.lock.ts  then 

f  signal  abort-DA(Reader.aid) 

end 
end 

%  Reader  is  a  descendant  of  the  top  write  locker  (if  any) 
if  lock_set$member(Reader.aid,OBJ. read-set)  then 
return(already_hasJock(Reader,OBJ)) 

else 

return(can-haveJock(Reader,OBJ)) 

end 

end  check.locks_for_read 


Figure  4-6:  The  lock  acquisition  rules  and  the  additional  rules  applied  for  a  (possibly  discon¬ 
nected)  reader 

Figure  4-6  and  Figure  4-7  extend  the  algorithms  presented  in  Figure  2-1  and  Figure  2-2  to 
apply  the  new  additional  rules  as  well.  The  added  lines  are  marked  with  a  “f”.  Every  conflicting 
lock  is  first  checked  by  the  current  rules,  and  if  it  does  not  conflict  with  the  request,  then  by 
the  new  ones.  Note  that  only  the  top  write  lock  is  checked;  due  to  the  ancestry  order  of  the 
write  lock  stack,  all  those  below  the  top  belong  to  ancestors,  whose  timestamps  are  no  greater 
than  the  one  of  the  top  lock. 
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checkJocks.for.write  =  proc(Writer:ainfo,  OBJ:object)  returns(reply)  signals(abort.DA(aid)) 

%  Called  at  a  site  to  check  whether  Writer  can  get  a  WRITE  lock  on  OBJ 
for  readJock:lock  in  lock.set$e!ements(OBJ.read_set)  do 

if  aid$non.d!5Scendani(Writer.aid,readJock.aid)  then 

return(can.not-have.lock(Writer,OBJ,readJock.aid)) 
f  elseif  aid$disconnected(Writer.aid)  cand  Writer.ts  <  readJock.ts  then 

f  signal  abort.DA(  Writer,  aid) 

end 
end 

if  versions-stack$non.empty(OBJ.write-stack)  then 

top.writeJock :  lock  :=  versions-stack$topJock(OBJ.write_stack) 
if  aid$non.descendant{Wr:ter.aid,top.writeJock.aid)  then 

return(can.noUiaveJock{Writer,OBJ,top.writeJock.aid)) 
elseif  Writer.aid  =  top.write_lock.aid  then 
return(already_hasJock(Writer,OBJ)) 

f  elseif  aid$disconnected(Writer.aid)  cand  Writer.ts  <  top.writeJock.ts  then 

f  signal  abort-DA(Writer.aid) 

end 

end 

%  Writer  is  a  descendant  of  the  top  write  locker  (if  any) 

return(can.haveJock(Writer,OBJ)) 

end  checkJocks-for.write 


Figure  4-7:  The  lock  acquisition  rules  and  the  additional  rules  applied  for  a  (possibly  discon¬ 
nected)  writer 

4.2.2  Implementation  of  the  Timestamp  Mechanism 

Timestamps  are  implemented  as  part  of  the  local  action  (i.e.,  AINFO.ts)  and  as  part  of  every 
lock.  A  possible  implementation  of  timestamps  is  given  in  Appendix  B.  Messages  sent  for 
creating  a  handler  call  or  committing  it  need  also  to  carry  the  timestamp  of  the  action  (the  one 
on  the  sending  side). 

Management  of  actions’  timestamps 

Figure  4-8  outlines  the  way  our  model  is  extended  to  maintain  timestamps  on  actions.  When 
any  action  is  created,  it  inherits  its  parent’s  timestamp  or  zero  if  it  is  a  topaction.  The  action’s 
timestamp  is  incremented  immediately  after  creating  a  (top)  DA  or  a  concurrent  set  that  in¬ 
cludes  (top)  DAs,  hence  giving  an  active  action  a  timestamp  higher  than  that  of  its  disconnected 


47 


Figure  4-8:  Timestamp  management  rules  for  any  action  A 

Note  how  the  case  of  concurrent  siblings  that  created  DAs  is  treated:  The  parent  action 
(concurrent-set  action),  which  is  blocked  until  all  its  concurrent  subactions  terminate,  gets  the 
maximal  timestamp  from  the  committed  subactions  before  it  continues.  This  way  the  parent’s 
timestamp  will  be  larger  than  that  of  any  DA  that  was  created  by  its  concurrent  subactions. 

Management  of  timestamps  on  locks 

The  rule  for  managing  timestamps  on  locks  is  simple:  Whenever  an  action  A  (whose  current 
timestamp  is  A.ts )  accesses  an  object  0,  on  which  A  has  the  appropriate  lock  L,  do: 

L.ts  :=  A.ts 

Note  that  the  rule  applies  in  two  different  cases:  When  A  first  acquires  the  lock  L,  and  in 
each  of  A’s  subsequent  accesses  to  03.  The  rule  has  to  apply  to  actions  that  have  already  gotten 

3This  is  the  reason  for  making  the  distinction  between  “can.have.lock”  and  the  reply  “already .hasJock”  in 
our  code. 


the  lock  because  actions’  timestamps  can  increase  during  their  lifetime.  The  cost  of  updating 
timestamps  on  every  access  is  negligible  since  the  appropriate  lock  is  checked  on  every  access 
anyhow  (see  the  requirements  on  page  21). 

Separation  of  read  and  write  accesses,  and  the  requirement  to  have  the  appropriate  lock 
for  each4,  prevent  some  unnecessary  aborts  of  DAs.  For  example,  suppose  some  action  A 
with  timestamp  1  writes  to  an  object  0,  and  later  reads  0  when  its  timestamp  is  3.  A’s 
disconnected  child  A. D5  with  a  timestamp  2  should  be  able  to  read  0  at  this  point,  but  not 
modify  it.  Unless  A  has  separate  locks,  one  for  each  operation,  A.D 5  may  be  denied  a  read-lock 
on  0  if  the  timestamp  on  A’s  write  lock  reflects  A’s  read  access  (with  timestamp  3). 


inherit.read.lock  =  proc  (Ancestor,  Locker:  aid,  OBJ:object) 

%  Requires:  Locker  descendant  of  Ancestor,  Locker  has  a  read-lock  on  OBJ 
%  Modifies:  OBJ.readjset  and  Locker's  read  lock  there 
%  Effects:  Ancestor  inherits  Locker's  read-lock  on  OBJ 
locker-lock :  lock  :=  lock-set$find(Locker,OBJ.read.set) 
lockerJock.aid  :=  Ancestor 

ancestorJock :  lock  :=  lock_set$find(Ancestor,OBJ.read.set) 
except  when  not-found:  return  end 
locker.lock.ts  :=  timestamp$max(ancestorJock.ts,lockerJock.ts) 
lock_set$remove(Ancestor,OBJ.read.set) 
end  inherit-readJock 


Figure  4-9:  Ancestor  inherits  Locker’s  read-lock  on  OBJ 

When  the  locks  of  an  action  are  inherited  by  an  ancestor,  the  timestamps  on  the  locks  are 
not  affected  by  the  ancestor’s  current  Timestamp  (i.e.,  AINFO.ts).  The  timestamp  is  a  part  of 
the  lock  (like  the  version),  and  only  the  lock’s  owner  is  replaced.  Figure  4-9  describes  how  a 
read-lock  is  inherited  by  an  ancestor.  Basically  the  lock  owner  is  replaced,  and  if  the  ancestor 
already  has  a  read-lock,  it  is  discarded.  The  maximal  timestamp  of  the  two  locks  prevails;  this 
way  the  latest  read  access  on  behalf  of  the  ancestor  Is  reflected. 

Figure  4-10  describes  how  a  write-lock  is  inherited  by  an  ancestor.  The  inherited  lock,  with 
its  version  and  timestamp,  replaces  all  the  locks  below  in  the  stack  up  to  (and  including)  the 
ancestor’s  lock.  Note  that  the  timestamp  on  the  write-lock  reflects  the  time  its  version  was  last 
modified,  regardless  of  the  ancestor’s  (possibly  higher)  timestamp  at  the  time  of  the  inheritance. 

4I.e.,  a  read-lock  is  needed  before  reading  even  if  the  action  already  has  a  write-lock  there,  see  page  21. 
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inherit-writeJock  =  proc  (Ancestor,  Locker:  aid,  OBJrobject) 

%  Requires:  Locker  descendant  of  Ancestor. 

%  Locker  has  the  top  write-lock  in  OBJ.  write^stack 

%  Modifies:  OBJ.  write jstack 

%  Effects:  Ancestor  inherits  Locker's  write-lock  (with  version)  on  OBJ 
lockerJock :  lock  :=  versions-stack$top(OBJ.write_stack) 
top.aid  :  aid  :=  lockerJock.  aid 
while  aid$descendant(top.aid, Ancestor)  do 
versions-stack$pop(OBJ.write-stack) 
top-aid  :=  versions_stack$top(OBJ.write.stack).aid 
end 

lockerJock.aid  :=  Ancestor 
versions-stack$push(OBJ.write.stack,  lockerJock) 
end  inherit-writeJock 


Figure  4-10:  Ancestor  inherits  Locker’s  write-lock  on  OBJ 


4.2.3  Timestamps  for  Nested  DAs 


Our  timestamp  mechanism  can  be  extended  to  handle  the  case  of  more  than  one  level  of 
disconnection  (i.e.  nested  DAs).  Such  a  case  is  shown  in  Figure  4-11;  the  disconnected  action 
A.2.D2  created  another  DA,  A.2.D2.D1.  The  difference  between  this  case  and  the  case  of  a 
single  disconnection  level  is  that  a  nested  DA  like  A.2.D2.D1  has  to  be  synchronized  with  an 
ancestor  that  is  itself  disconnected.  A  timestamp  represented  as  a  single  numerator  does  not 
suffice  here,  and  has  to  be  extended  to  a  multi-part  timestamp.  The  new  lock  acquisition  rules, 
on  the  other  hand,  are  basically  the  same. 

The  extended  timestamp  mechanism  employs  a  separate  timestamp  for  every  level  of  dis¬ 
connection.  A  disconnected  action  D  at  some  level  k  of  disconnection  (i.e.,  there  are  k  discon¬ 
nections  on  the  tree  path  between  D  and  the  root  topaction)  keeps  a  list  of  k  +  1  timestamps. 
For  example,  when  the  DA  A.2.D2  (in  Figure  4-11)  is  created,  it  gets  its  parent’s  timestamp 
of  1  with  a  new  level  (initially  0)  added,  represented  as  1.0  .  Two  (Multi-level)  timestamps  are 
compared  by  matching  them  level  by  level,  left  to  right;  the  first  level  with  non-equal  values  de¬ 


termines  the  order  (a  short  timestamp  is  padded  with  zeroes  on  its  right  side).  (Disconnected) 
actions  update  only  the  right  most  part  of  their  timestamp  when  they  create  DAs.  For  exam 
pie,  A.2.D2's,  timestamp  is  changed  from  1.0  to  1.1  after  creating  a  DA.  The  additional  part  in 
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Figure  4-11:  An  example  for  serializing  nested  DAs  using  timestamps 
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the  timestamp  has  no  effect  on  the  way  committing  children  update  their  parents’  timestamps 
because  the  difference  in  the  number  of  parts  occurs  only  between  parents  and  top  DA  children, 
who  do  not  commit  normally  anyhow. 

The  timestamp  rules  and  implementation  presented  in  this  chapter  apply  to  nested  DAs 
as  well.  The  multi-part  details  are  encapsulated  inside  the  abstraction  for  timestamps  (see 
the  implementation  in  Appendix  B).  Timestamp  operations  like  increment,  max,  equal,  etc., 
handle  the  appropriate  parts  of  the  representation  internally  as  needed. 

4.3  Concurrent  DAs 

Disconnected  actions  that  were  created  by  concurrent  subactions  (concurrent  DAs)  pose  prob¬ 
lems  that  can  not  be  solved  by  a  mechanism  like  timestamps.  The  timestamp  mechanism 
proposed  above  enforces  the  static  creation  order  of  the  serial  subactions  upon  their  discon¬ 
nected  actions.  Concurrent  subactions,  on  the  other  hand,  are  serialized  dynamically.  They  are 
created  simultaneously,  run  concurrently  and  should  behave  as  though  they  were  run  in  some 
serial  order.  This  order  reflects  the  order  in  which  the  subactions  accessed  shared  atomic  data 
and  is  determined  dynamically  as  the  actions  run. 

Execution  of  an  action  that  uses  concurrent  DAs  may  result  in  an  inconsistent  state.  The 
scenario  described  in  Section  4.1  can  take  place  when  the  two  subactions,  A.  1  and  A. 2,  are 
created  simultaneously  as  concurrent  siblings  (and  have  initially  the  same  timestamp).  The 
locking  rules  presented  above  do  not  prevent  the  non-serializable  execution;  A.  1  updates  C\ 
and  C2,  commits  and  A  inherits  its  locks,  and  A. 2  acquires  the  locks  held  by  A,  does  its  work 
and  commits  to  A.  The  DA  A.1.D1  can  acquire  the  lock  on  C3  because  A  inherited  it  from 
A. 2  .  Timestamps  are  not  sufficient  in  this  case,  and  no  inlr  •  .ation  exists  at  Cs's  site  (assume 
that  the  other  actions  run  at  other  sites)  to  tell  whether  A.l.Dl  can  get  a  lock  on  C3  or  not. 
By  acquiring  the  lock  on  C3,  A.  1  .D 1  serializes  its  concurrent  parent  A.  1  after  its  sibling  .4.2 
that  affected  C3  before,  resulting  in  a  non-serializable  execution  of  A.  1  and  A. 2  since  Ci  was 
accessed  in  the  opposite  order.  Note  that  the  serialization  order  is  determined  by  the  order  in 
which  locks  were  passed  from  one  relative  (A.l)  to  another  (A. 2)  via  the  LCA  (.4);  our  solution 
takes  advantage  of  this  fact  to  monitor  the  serialization  of  concurrent  subactions  serially  at  the 
site  of  the  LCA. 
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In  a  model  without  DAs,  the  commit  order  of  concurrent  subactions  is  a  possible  serial  order 
of  the  subactions;  the  exact  serialization  order  is  usually  only  a  partial  order  consistent  with  the 
commit  order.  When  DAs  are  used,  the  actual  commit  order  of  the  concurrent  subactions  may 
not  be  consistent  with  the  way  they  get  serialized;  a  disconnected  descendant  of  a  concurrent 
subaction  S  may  be  active  after  S's  commit  and  force  serialization  constraints  on  S  by  acquiring 
locks  on  objects  that  were  accessed  by  other  concurrent  siblings  of  S. 

This  chapter  proposes  a  new  mechanism  that  uses  Constraint  Tables  and  Fingerprint  Locks 
to  control  access  of  disconnected  actions  to  objects  locked  in  conflicting  mode  by  a  concurrent 
relative.  The  new  mechanism  monitors  the  serialization  order  of  the  concurrent  subaction, 
and  keeps  an  explicit  record  of  it  (as  a  Constraint  Table);  this  record  is  consulted  as  needed 
whenever  a  DA  tries  to  acquire  a  lock. 

The  idea  behind  the  new  mechanism  is  to  have  centralized  control  over  the  serialization 
of  concurrent  subactions.  Our  problem  is  analogous  to  the  deadlock  detection  problem  in 
distributed  systems.  A  deadlock  implies  a  cycle  of  blocked  concurrent  activities,  each  waiting 
for  next  one;  a  non-serializable  execution  of  concurrent  atomic  actions  can  happen  if,  by 
creating  serialization  constraints  (by  accessing  shared  objects  and  using  DAs),  a  cycle  of  actions 
is  created  where  each  action  is  serialized  after  the  next.  Our  solution  is  similar  in  principle  to 
the  deadlock  detection  algorithm  that  uses  a  centralized  wait-for  graph  ([Gray  1978])  to  monitor 
wait-for  relations  and  prevent  a  cycle. 

The  new  mechanism  does  have  a  significant  performance  cost,  but  it  is  a  very  permissive 
mechanism  that  ensures  serial  behavior.  Other  mechanisms  can  be  devised  that  have  a  lower 
cost,  but  they  may  abort  many  DAs  unnecessarily.  For  example,  another  mechanism  can  be 
designed  to  totally  prevent  DAs  from  acquiring  conflicting  locks  on  objects  that  were  used  by 
concurrent  relatives;  such  a  mechanism  need  not  track  serialization  order,  but  is  too  restrictive. 

The  rest  of  this  section  is  organized  as  follows:  We  begin  the  description  of  the  new  mech¬ 
anism  by  presenting  a  simplified  version  of  our  solution,  and  continue  by  describing  how  we 
make  the  simplified  solution  practical.  Next  we  give  details  of  the  way  our  solution  can  be 
implemented,  and  end  with  discussion  of  several  .osues  that  were  ignored  in  our  solution. 
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4.3.1  A  simplified  solution 

The  simplified  solution,  presented  below,  to  the  problem  of  serializing  concurrent  subactions 
that  create  DAs,  ignores  costs  (i.e.,  extra  messages,  time  and  space)  and  failures  (i.e.,  crashes 
and  aborts);  it  focuses  on  the  basic  method  to  monitor  serialization  of  concurrent  subactions 
and  control  access  of  concurrent  DAs  to  objects. 

Sites  maintain  a  Constraint  Table  ( CT)  for  every  concurrent-set  action  created  locally.  The 
CT  can  be  represented  by  an  nxn  matrix,  where  n  is  the  total  number  of  concurrent  subactions 
in  the  set,  and  an  entry  CT(i,j)  describes  the  serialization  order  between  subactions  i  and  j 
(“before”,  “after”  or  has  some  initial  value).  The  CT  represents  a  directed  graph  in  which 
each  vertex  is  a  concurrent  subaction,  and  each  edge  is  a  serialization  constraint.  A  cycle  in 
the  constraint  graph  is  equivalent  to  a  non-serializable  execution  of  the  concurrent  subactions 
([Eswaran,  et  al.  1976]). 

We  assume  that  every  atomic  object  remembers  its  history.  For  example,  an  object  0  that 
was  modified  by  a  subaction  Si  remembers  Si,  enabling  O’s  site  to  know  that  a  subsequent 
access  of  a  subaction  S2  to  O  (in  conflicting  mode)  implies  the  serialization  constraint  “S2 
ordered  after  Si”. 

Whenever  any  subaction  S2  tries  to  acquire  a  lock  on  an  object  that  was  previously  accessed 
by  a  (committed)  concurrent  relative  Si  (in  conflicting  mode),  a  special  query  message,  CT- 
query,  has  to  be  sent  from  the  object’s  site  (Go)  to  Glca:  the  site  of  (the  concurrent-set  action) 
XCA(Si,S2),  specifying  the  possible  constraint  “S2  ordered  after  Si”.  When  Glca  receives 
the  CTquery,  it  adds  the  serialization  constraint  to  the  CT  of  LCA(S\,Sz)  and  acknowledges 
with  a  CTreply  message.  Hence  the  CT  maintains  explicitly  the  partial  serialization  order  of 
its  concurrent  subactions. 

A  CTquery  un  behalf  of  a  regular  subaction  S2  is  only  used  for  monitoring  serialization;  for 
a  DA  D  the  CTquery  message  and  its  acknowledgment  are  also  used  to  control  the  DA’s  access 
to  objects.  When  an  access  by  D  is  found  to  create  a  cycle  in  the  constraint  graph,  the  CT’s 
site  does  not  update  the  CT  and  uses  the  CTreply  message  to  inform  the  D' s  site  that  access 
should  be  denied  (i.e.,  “abort  D"),  otherwise  the  CT  is  updated  and  the  CTreply  approves  of 
D’s  access. 

Figure  4-12  illustrates  the  simplified  solution.  When  a  subaction  S  tries  to  access  an  object 
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CTqueryf  A.Cl.j  =>  A.Cl.i ) 


Figure  4-12: 
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0  at  site  Go,  0  if  found  to  have  been  used  before  by  a  concurrent  relative  of  S.  A  CTquery 
message  is  sent  from  Go  to  the  site  Ga  of  LCA(S,A,Cl.i),  describing  the  new  constraint 
“A.Cl.j  must  be  serialized  after  A.Cl.i”.  The  constraint  table  at  Ga  is  updated  and  a  CTreply 
is  returned.  If  S  had  been  a  disconnected  action,  then  the  CT  would  be  checked  first,  and  the 
CTreply  would  return  the  decision. 

4.3.2  Making  the  solution  practical 

The  simplified  solution  proposed  above  works,  but  severed  efficiency  issues  must  be  dealt  with 
to  make  this  solution  practiced.  In  the  following  we  discuss  the  major  issues:  How  to  reduce 
the  number  of  messages,  how  to  avoid  the  overhead  when  DAs  are  not  used  and  how  to  ensure 
that  objects  would  have  their  relevant  access  history  available  when  locks  are  acquired. 

Reducing  the  number  of  messages 

Our  simplified  mechanism  requires  exchange  of  messages  for  every  access  of  an  action  to  an 
object  that  was  used  by  a  concurrent  relative.  We  propose  here  to  reduce  the  number  of 
messages  by  caching  previous  replies  locally  and  using  other  existing  messages  to  carry  some  of 
the  data  needed  by  our  mechanism. 

The  use  of  CTquery  messages  is  parallel  in  many  cases  to  the  use  of  regular  query  messages 
(to  the  LCA).  Basically  all  that  is  needed  to  make  a  regular  query  message  into  a  CTquery 
message  is  the  name  (AID)  of  .he  action  that  tries  to  access  the  object.  Our  solution  unifies  the 
two  kinds  into  one.  Though  this  conversion  of  regular  queries  (and  replies)  seems  to  come  with 
no  cost,  we  do  lose  a  possible  optimization:  Regular  queries  are  not  sent  on  behalf  of  specific 
actions  accessing  specific  objects;  therefore  one  query  may  be  sent  on  behalf  of  several  actions 
trying  to  use  several  objects.  We  can  still  use  this  optimization  when  a  CTque *  y  is  not  needed. 

Some  CTqueries  can  not  “catch  a  ride”  on  existing  queries.  In  some  cases  a  regular  query  is 
not  needed,  but  giving  a  lock  still  implies  the  creation  of  serialization  constraint  (see  Fingerprint 
Locks  below).  We  send  more  (modified  regular)  queries  in  these  cases,  but  may  refer  to  them 
as  CTqueries  to  stress  their  purpose. 

Local  caching  can  help  avoid  some  messages.  The  idea  is:  The  site  replying  with  CTreply  will 
add  the  relevant  CT  to  the  reply,  and  the  site  receiving  the  reply  will  cache  that  CT  locally, 
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possibly  replacing  an  older  copy.  This  way,  if  the  local  copy  is  consulted  first,  the  message 
exchange  can  be  avoided  when  the  order  is  known  locally. 

Enable  the  mechanism  only  when  DAs  are  used 

As  stated  before,  one  of  our  design  goals  is  to  have  the  new  model  perform  similarly  to  the 
current  one  when  DAs  are  not  in  use.  Having  a  constraint  table  for  every  concurrent  set  of 
subactions  and  sending  CT  messages  does  not  contribute  to  this  goal.  We  would  like  the  CT 
mechanism  to  be  triggered  only  when  it  is  needed,  i.e.,  after  some  concurrent  subaction  created 
DAs.  The  problem  is  that  the  site  of  the  concurrent-set  action  C  has  no  a  priori  knowledge 
about  C's  intention  to  create  DAs;  we  need  a  way  to  detect  it  dynamically. 

We  choose  the  triggering  event  to  be  the  commit  of  the  first  concurrent  subaction  S  with  a 
timestamp  bigger  than  its  ancestor’s,  the  concurrent-set  action  C.  This  event  implies  at  least 
one  concurrent  subaction  ( S )  created  DAs,  and  our  mechanism  is  needed  (except  when  S  is 
the  last  concurrent  subaction  to  commit  to  C ).  Triggering  the  CT  mechanism  means  not  only 
creating  a  new  CT,  but  also  notifying  sites  that  make  relevant  queries  that  CT  caching  has  to 
be  done  and  that  lock  propagation  must  be  handled  differentiy  (see  below). 

When  S  commits  and  triggers  the  mechanism  for  <7,  a  new  constraint  table  is  allocated 
at  C’s  site  for  C.  This  CT  needs  not  handle  those  concurrent  subactions  of  C  that  already 
committed5,  with  the  exception  of  S ,  which  also  gets  an  entry  in  the  CT. 

Once  the  mechanism  has  been  triggered  (for  C),  the  “committed  to  LCA”  reply  to  every 
query  regarding  C’s  subactions  (replied  from  C’s  site)  must  be  tagged  with  the  tag  make.FP 
and  carry  a  recent  copy  of  the  CT.  A  site  receiving  a  tagged  reply  would  cache  the  CT  copy 
locally  and  handle  lock  propagation  differently  (as  described  below). 

Note  that  C’s  site  may  notice  that  DAs  were  created  on  C’s  behalf  before  S  commits  by 
receiving  a  query  from  (or  about)  some  DA,  descendant  of  C.  The  mechanism  need  not  be 
triggered  at  this  point  because  non-serializable  execution  of  concurrent  subactions  using  DAs 
must  include  a  commit  of  some  subaction  using  DAs6. 

sBecause  they  created  no  DAs. 

sThe  execution  is  serializable  prior  to  tiie  commit  of  the  first  concurrent  subaction  that  created  DAs  because 
no  concurrent  subaction  has  yet  violated  the  rule  of  strict  two-phase  locking;  i.e.,  acquired  a  lock  (by  a  descendant 
DA)  after  releasing  another. 
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Identify  the  last  action(s)  to  access  an  object 

The  simplified  solution  required  that  the  history  of  each  object  be  kept,  so  the  last  (concurrent) 
action  to  access  the  object  in  conflicting  mode  can  be  known.  Our  model  does  keep  this 
information  with  each  object  in  the  form  of  locks,  but  once  locks  are  inherited,  the  identity 
of  the  real  accessor  is  no  longer  known.  More  specifically,  after  a  lock  that  was  held  by  a 
descendant  of  a  concurrent  subaction  S  is  inherited  by  (an  ancestor  of)  the  concurrent-set 
action  C,  it  is  impossible  to  tell  which  of  C’s  concurrent  subactions  had  it.  We  need  a  way  to 
remember  the  last  (concurrent)  subaction  that  modified  some  object,  and  possibly  several  that 
read  the  object  (and  are  serialized  after  that  writer). 

Our  solution  is  to  both  keep  that  inherited  lock  and  inherit  it.  This  can  be  done  as  follows: 
Inherit  locks  as  before,  but  once  a  lock  is  propagated  to  the  level  of  the  concurrent-set  action, 
a  copy  of  the  old  lock  is  kept  as  a  Fingerprint  Lock  (FPL).  This  way,  when  a  subaction  tries  to 
get  a  lock  on  an  object  previously  used  by  some  concurrent  relative,  it  would  be  known  with 
whom  it  conflicts.  The  FPL  is  a  regular  (read  or  write)  lock,  but  is  tagged  with  the  boolean 
flag  fp.  The  FPL  is  transparent  to  operations  that  do  not  need  it  (e.g.,  lock  query  by  a  later 
serial  relative  or  commit  of  the  transaction7),  and  is  used  only  by  our  mechanism. 

Our  solution  actually  requires  a  FPL  to  exist  on  an  object  before  granting  a  conflicting 
lock;  that  is,  first  queries  are  sent,  and  their  replies  cause  such  conflicting  locks  to  propagate 
and  also  become  FPLs,  next  checks  are  done  to  ensure  the  correct  serialization.  This  leads  to  a 
simple  (and  uniform)  additional  lock  granting  rule  (and  implementation):  “The  local  CT  must 
approve  the  ordering  constraint  with  every  conflicting  FPL  before  the  lock  is  granted”. 

We  could  generate  FPLs  whenever  a  lock  is  inherited  (as  explained  above)  and  keep  it 
until  the  transaction  commits,  but  this  may  take  a  performance  toll.  Our  implementation 
begins  generating  FPLs  only  after  the  CT  mechanism  has  been  triggered  (for  the  appropriate 
concurrent  set  of  subactions);  that  is,  FPLs’would  be  created  only  if  the  reply  is  tagged  with 
make-FP.  FPLs  are  also  discarded  once  they  are  no  longer  needed;  we  describe  below  how 
obsolete  write  FPLs  are  discarded,  as  a  part  of  maintaining  the  write  stack,  and  ignore  for  now 
the  need  to  discard  read  FPLs,  as  obsolete  read  FPLs  only  slow  our  algorithm,  but  do  not 

7We  ignore  FPLs  in  the  description  of  the  commit  mechanism  (Chapter  5)  as  they  have  no  effect  there. 
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Figure  4-13:  An  example  of  multiple  levels  of  concurrent  subactions. 


change  its  behavior. 

Note  that  propagation  of  a  lock  from  a  holder  H  to  an  ancestor  A  can  generate  more  than 
one  FPL;  when  there  are  several  concurrent-set  actions  between  H  and  A  (including  A),  the 
“arm”  (concurrent  subaction)  in  each  must  be  remembered.  This  applies  to  both  read  and 
write  locks. 

Such  a  case  is  illustrated  in  Figure  4-13.  The  subaction  A. 1.1.2  created  DAs  and  caused 
A.l.l’s  CT  to  be  triggered.  Cascaded  triggering  occurs  when  A.1.1  commits  to  A.l  and  later 
to  A.  After  that,  if  the  subaction  A.2  tries  to  get  a  (conflicting)  lock  on  some  object  0  that  is 
locked  by  A. 1.1.1,  the  query  sent  to  A’s  site  indicates  that  with  “A.l  committed  up  to  A”,  and 
tagged  with  make.FP. 

Creating  a  FPL  on  0  for  A.l  alone  is  not  enough,  because  A.2  may  abort  after  that  and 
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a  DA  D  created  by  A.  1.1.2  may  try  later  to  get  a  lock,  but  no  trace  of  A.l.l.l’s  access  would 
then  be  found  on  0.  The  solution  is  to  create  a  FPL  for  every  concurrent  subaction  between 
the  lock  holder  and  the  LCA.  In  our  example,  when  the  tagged  reply  arrives,  three  FPLs  are 
needed:  for  A.  1,  A.1.1  and  A. 1.1.1  . 

Once  a  read  FP  lock  is  created,  it  is  put  in  the  unordered  set  of  read-locks.  Multiple  read 
FPLs  (of  the  same  concurrent  level)  may  exist  on  an  object  because  readers  do  not  enforce 
serialization  constraints  among  themselves.  For  simplicity,  our  implementation  assumes  that 
read  FPLs  are  not  discarded  unless  some  ancestor  aborted  or  their  transaction  committed. 

When  a  write  FP  lock  is  created,  it  is  put  on  top  of  the  write  stack;  when  several  FPLs 
are  created,  they  are  left  on  the  stack  in  ancestry  order.  See  Figure  4-14  for  the  state  of  the 
stack  in  the  case  of  the  previous  example.  When  an  action  like  A. 2  acquires  a  write  lock,  it 
will  be  positioned  above  the  FPLs  in  the  stack,  so  if  A. 2  aborts,  its  lock  is  discarded  and  other 
actions  (e.g.,  concurrent  siblings  of  A.  1  and  A. 2)  can  still  use  the  FPLs.  If  A. 2  commits  and 
A  inherits  its  write  lock,  then  the  three  FPLs  (shown  in  Figure  4-14)  become  obsolete,  and  are 
discarded,  but  a  new  FPL  is  created  for  A. 2  .  Note  that  our  implementation  of  “inherit-writeJock" 
(page  50)  already  discards  all  these  locks  since  it  discards  all  write-locks  (of  any  kind)  between 
the  inherited  lock  (e.g.,  A.2’s  lock  on  the  top)  to  the  position  of  the  ancestor  (A). 

In  general,  the  sequence  of  real  locks  on  the  write  stack  will  be  interleaved  with  some 
subsequences  of  FPLs.  The  real  locks  represent  the  current  path  P  in  the  action  tree,  while 
the  subsequences  of  FPLs  represent  previous  paths  branching  away  from  P  that  may  become 
useful  if  actions  with  locks  above  them  abort.  When  the  top  lock  is  a  real  lock,  regular  work  is 
done  with  it,  ignoring  possible  FPLs  below.  But  when  several  actions  abort,  the  FPLs  below 
may  get  exposed. 

4.3.3  Implementation 

Below  we  describe  the  fine  details  needed  to  implement  our  mechanism.  We  assume  that  sites 
keep  CTs  and  manage  them,  updating  copies  as  tagged  replies  arrive  and  discarding  them  when 
the  transaction  terminates8.  Queries  are  assumed  to  contain  the  AID  of  the  action  A  requesting 
the  lock,  and  are  sent  whenever  A  can  not  get  a  lock  due  to  a  conflicting  one. 

®CTs  can  be  kept  in  a  data  structure  similar  to  ACTIVE  or  COMMITTED. 
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Figure  4-14:  The  state  of  the  write  stack  with  FP  locks  (boxed). 

When  a  concurrent  subaction  S  commits  to  its  concurrent-set  action  C  (which  must  happen 
locally),  if  S.ts  is  bigger  than  C.ts  then  a  new  CT  is  created  for  C  locally  (if  not  yet  so  done) 
to  map  all  C's  subactions  that  are  still  active,  including  S.  The  implementation  must  be  able 
to  tell  at  any  time  which  of  C’s  subactions  have  not  yet  terminated. 

When  a  lock  query  arrives  at  C  site,  Gc >  specifying  that  a  descendant  of  a  concurrent 
subaction  S2  requests  a  lock  held  by  S\  (such  that  C  =  LCA(Si,S2)),  then  if  S\  has  committed 
to  C,  and  C  has  a  CT,  then  that  CT  is  updated  if  necessary,  unless  the  update  creates  a  cycle. 
Then  Gc  replies,  tagging  its  “committed  to  C”  reply  with  make.FP  and  adding  a  copy  of  the 
CT  to  it. 

When  a  tagged  reply  arrives,  implying  that  the  locks  of  a  locker  S  are  to  be  propagated 
to  an  ancestor  A,  the  locks  are  propagated  to  A,  but  FPLs  are  created  for  every  concurrent 
subaction  on  the  path  between  S  and  A  ( S  included,  A  excluded).  Figure  4-15  describes  the 
routine  to  be  called  when  a  tagged  reply  arrived  (for  a  non-tagged  reply,  the  one  in  Figure  4-9 
is  used),  and  Figure  4-16  gives  the  routine  for  write  locks  (the  one  in  Figure  4-10  is  used  when 
non-tagged  replies  arrive). 
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make.FP.read  =  proc  (Ancestor,  Locker:  aid,  OBJ:object) 

%  Called  when  the  reply  is  tagged,  instead  of  “inheritjeadJock" 

%  Requires:  Locker  descendant  of  Ancestor,  Locker  has  a  read-lock  on  OBJ 
%  Modifies:  OBJ.readset 

%  Effects:  Ancestor  inherits  Locker's  read-lock  on  OBJ,  and  FPLs  are 
%  created  for  all  the  concurrent  subactions  between  them. 

lockerJock :  lock  :=  lock$copy(  %  inheriLreadJock  modifies  the  original 

lock^et$find(Locker, OBJ. read  jset))  %  return  a  reference  to  that  lock 

inherit.read.lock(Ancestor, Locker, OBJ) 

for  conc_sub:aid  in  aid$concurrent.subactionsJ3etween(Ancestor,Locker)  do 
FPL  :  lock  :=  lock$copy(lockerJock) 

FPL.aid  :=  conc-sub 
FPL.fp  :=  true 

lock.set$insert(FPL,  OBJ.readset) 

end 

end  make.FP.read 


Figure  4-15:  Ancestor  inherits  read-locks  and  creates  read  FPLs. 


make  FP.write  =  proc  (Ancestor,  Locker:  aid,  OBJ:object) 

%  Requires:  Locker  descendant  of  Ancestor. 

%  Locker  has  the  top  write-lock  in  OBJ.  write jstack 

%  Modifies:  OBJ.  write stack 

%  Effects:  Ancestor  inherits  Locker's  write-lock  (with  version)  on  OBJ, 

%  and  FPLs  are  created  for  all  concurrent  subactions  between  them. 

lockerJock :  lock  :=  versionsjstack$top(OBJ.writejstack) 
inherit.’  <iteJock(Ancestor, Locker, OBJ) 

for  conc.sub:aid  in  aid$concurrent.subactions.between(Ancestor, Locker)  do 
FPL :  lock  :=  lock$copy(lockerJock) 

FPL.aid  :=  conc-sub 
FPL.fp  :=  true 

versions.stack$push(OBJ.write_stack,FPL) 

end 

end  make.FP.write 


Figure  4-16:  Ancestor  inherits  write-locks  and  creates  write  FPLs. 


The  algorithm  for  deciding  on  lock  grant  becomes  simple  due  to  the  fingerprint  scheme 
shown  above;  when  an  action  tries  to  acquire  a  lock  on  an  object,  conflicting  locks  of  concurrent 
relatives  are  replaced  (if  our  mechanism  was  triggered)  with  FP  locks,  and  the  serialization 
constraint  can  be  known  locally.  If  by  that  time  relevant  information  has  already  arrived  at  the 
local  CT,  the  lock  granting  process  can  continue,  else  an  extra  query  is  needed. 


serialized-after  =  proc  (S,locker:aid)  returns(bool)  signals(abort.DA(aid)) 

%  Called  when  subaction  S  accesses  an  object  with  an  FP  lock  of  locker. 

%  Makes  a  decision  based  on  the  relevant  CT. 

%  Effects:  Returns  true  if  S  is  serialized  after  locker  in  the  relevant 
%  local  CT  or  if  the  two  are  not  concurrent  relatives. 

%  MEANING:  S  can  ignore  this  FPL 

%  Returns  false  if  no  order  exist  in  the  CT  or  CT  not  found. 

%  MEANING:  A  CTquery  need  to  be  sent  to  find  the  order. 

%  Signals  abort-da  if  S  is  serialized  before  locker  in  the  CT. 

%  MEANING:  S  must  be  a  DA  out  of  serial  order. 

if  aid$not-concurrent.relatives(S, locker)  then  return(true)  end 
relevant-Ct :  ct  :=  find_relevant-ct(aid$LCA(S,locker)) 
except  when  not-found:  return(false)  end 
if  ct$after(S,  locker)  then  return(true) 
else  signal  abort-DA(S) 
end  except  when  no-order:  return(false)  end 
end 

end  serialized-after 


Figure  4-17:  Check  local  Constraint  Table  when  a  lock  check  finds  an  FP  lock 

Figure  4-17  presents  an  abstraction  that  decides  for  an  action  A  finding  a  conflicting  FP  lock 
held  by  B,  based  on  local  CT  information,  which  of  the  three  cases  applies:  “Ignore  this  FPL” 
when  the  ordering  constraint  conforms  to  the  CT  data  (or  when  A  and  B  are  not  concurrently 
related),  “CTquery  is  required”  when  ordering  information  does  not  exist  locally,  and  “ abort 
this  DA”  when  A  is  (a  DA)  out  of  order. 

The  routine  serialized-after  is  used  by  our  lock  checking  procedures,  presented  in  Figures  4- 
18  and  4-19.  These  procedures  do  the  complete  checking  process,  combining  the  current  lock 
acquisition  rules,  the  additional  rules  that  use  timestamps,  and  the  rules  that  use  FPs.  Note 
that  these  are  the  procedures  presented  before  (in  Figures  4-6  and  4-7)  with  additional  lines 
(marked  with  “}”)  that  do  the  work  of  checking  FP  locks. 
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check.locks.for.read  =  proc(Reader:ainfo,  OBJ:object)  returns(reply)  signals(abort.DA(aid)) 

%  Called  at  a  site  to  check  whether  Reader  can  get  a  READ  lock  on  OBJ 
if  versions-stack$non.empty(OBJ.write.stack)  then 

top.writeJock :  lock  :=  versions.stack$topJock(OBJ.write^stack) 
t  while  top.writeJock.fp  do  %  scan  FPLs  top  down 

t  if  serialized.after(Reader.aid.top-writeJock.aid)  then 

|  top.writeJock  :=  top.writeJock.nextJbelow 

l  else  return(can.not.have  Jock(Reader, OBJ,  top-write  Jock.aid)) 

}  end  resigns!  abort.DA 

t  end 

if  aid$non.descendant(Reader.aid,top.writeJock.aid)  then 

return(can_not-haveJock(Reader,OBJ,top.writeJock.aid)) 
f  elseif  aid$disconnected(Reader.aid)  cand  Reader.ts  <  top.write.lock.ts  then 

|  signal  abort.DA(Reader.aid) 

end 

end 

%  Reader  is  a  descendant  of  the  top  write  locker  (if  any) 
if  lock.set$member(Reader.aid,OBJ.read.set)  then 
return(already_hasJock(Reader,OBJ)) 

else 

return(can.haveJock(Reader,OBJ)) 

end 

end  check.locks-for.read 


Figure  4-18:  The  complete  modified  lock  acquisition  rules  applied  when  a  (possibly  discon¬ 
nected)  action  requests  a  read  lock  on  the  object  OBJ. 

Whenever  one  of  the  two  procedures  replies  with  “can jiotJiave Jock”,  a  query  is  sent  to 
the  site  of  the  LCA,  regarding  the  conflict  between  the  action  A  that  tries  to  access  the  object 
and  the  conflicting  lock  owner.  The  reply  to  the  query,  if  tagged  with  make.FP,  calls  the  lock 
propagation  routines  described  above.  When  A  later  tries  again9,  another  query  may  need  to 
be  generated. 

4.3.4  Discussion 

Below  we  discuss  some  of  the  remaining  issues,  most  of  which  were  ignored  previously  to 
maintain  comprehensibility. 

9The  implementation  can  use  either  busy-waiting  or  put  A  in  a  waiting  queue  and  reactivate  it  after  locks 
propagate. 
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checkJocks.for.write  =  proc(Writer:ainfo,  OBJ:object)  returns(reply)  signa!s(abort.DA(aid)) 

%  Called  at  a  site  to  check  whether  Writer  can  get  a  WRITE  lock  on  OBJ 
for  read  Jock:lock  in  lock.set$elements(OBJ. read-set)  do 
if  aid$non.descendant(Writer.aid,readJock.aid)  then 
t  if  read.lock.fp  then 

{  if  serialized.after(Writer.aid,readJock.aid)  then  continue 

}  else  return(can_not.haveJock(Writer,OBJIreadJock.aid)) 

|  end  resignal  abort-DA 

$  else  %  this  lock  is  not  FPL 

return(can.notJiaveJock(Writer,OBJ,readJock.aid)) 

|  end 

f  elseif  aid$disconnected(Writer.aid)  cand  Writer.ts  <  read Jock.ts  then 

f  signal  abort-DA(Writer.aid) 

end 

end 

if  versions-stack$non.empty(OBJ.write-Stack)  then 

top.writeJock :  lock  :=  versions-Stack$topJock(OBJ.write-Stack) 
t  while  top.write.lock.fp  do  %  scan  FPLs  top  down 

|  if  serialized.after(Writer.aid,top.write-lock.aid)  then 

j  top.writeJock  :=  top.write.lock.next  Jselow 

f  else  return(can.not-have.lock(Writer,OBJ,topjwrite.lock.aid)) 

t  end  resignal  abort-DA 

t  end 

if  aid$non.descendant(Writer.aid,top.writeJock.aid)  then 

return(can.notJiave.lock(Writer,OBJ,top.writeJock.aid)) 
elseif  Writer.aid  =  top.write.Iock.aid  then 
return(already.has.lock(Writer,OBJ)) 

|  elseif  aid$disconnected(Writer.aid)  cand  Writer.ts  <  top.writeJock.ts  then 

t  signal  aborLDA(Writer.aid) 

end 

end 

%  Writer  is  a  descendant  of  the  top  write  locker  (if  any) 

return(can.haveJock(Writer,OBJ)) 

end  checkJocks.for.write 


Figure  4-19:  The  complete  modified  lock  acquisition  rules  applied  when  a  (possibly  discon¬ 
nected)  action  requests  a  write  lock  on  the  object  OBJ. 
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Discarding  obsolete  FP  read-locks 

As  described  above,  once  a  FP  read  lock  is  added  to  an  object,  it  remains  there  until  its 
transaction  terminates  (or  some  ancestor  aborts).  Some  FP  read-locks  may  become  obsolete 
later  in  the  execution;  lock  requests  would  not  be  denied  due  to  obsolete  FP  read-locks,  but 
they  would  need  to  be  checked  on  every  write  access  and  also  consume  space. 

A  FP  read-lock  L  becomes  obsolete  if  any  lock  holder  H  (read  or  write),  serialized  after 
L,  commits  to  an  ancestor  of  L.  The  reason  is  simple:  Later  write  accesses  would  have  1 
be  serialized  after  if,  thus  after  V s  holder  anyway.  In  summary,  a  FP  read-lock  L  held  by  a 
subaction  S  on  an  object  0 ,  can  be  discarded  when 

•  Any  write-lock  on  0  is  inherited  by  an  ancestor  of  S. 

•  Another  read-lock  on  0,  held  by  a  later  serial  relative  of  S  is  inherited  by  an  ancestor 
of  S. 

•  S' s  transaction  commits. 

•  Some  ancestor  of  S  aborts. 

Not  delaying  a  regular  subaction  that  created  a  constraint 

In  our  mechanism,  when  a  regular  subaction  S  finds  a  conflicting  FP  lock  held  by  a  concurrent 
relative  R,  and  the  ordering  constraint  “5  after  if”  is  not  found  in  the  local  CT,  a  query 
(actually  CTquery)  has  to  be  sent  to  the  site  of  LCA(S,R)  to  register  that  constraint. 

S  need  not  be  delayed  until  the  CTreply  comes;  S  is  a  regular  action  and  the  current  locking 
rules  already  ensure  its  serialization.  However,  the  following  rare  scenario  may  take  place  if  S 
is  not  delayed:  S  requests  a  lock  that  conflicts  with  a  FP  lock  held  by  the  concurrent  relative 
R  and  a  query  is  sent,  then  S  finishes  quickly  and  commits  to  the  concurrent- set  action,  before 
the  query  message  had  gotten  there.  Next  a  disconnected  descendant  of  R  may  request  one  of 
S' s  locks  and  get  it  because  the  ordering  constraint  has  not  yet  reached  the  CT. 

A  solution  to  the  problem  that  prevents  the  delay  of  S  is  to  piggyback  the  constraints  with 
the  regular  action  that  created  them,  and  check  the  CT  of  the  concurrent-set  action  C  when  S 
commits  up  to  C. 
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Handling  failures 

Our  mechanism  need  not  care  about  site  crashes;  when  a  site  that  was  visited  by  descendants 
of  a  concur  rent- set  action  C  that  committed  to  C  crashes  (before  C’s  transaction  prepares),  C 
becomes  an  orphan.  Subactions  that  abort,  however,  do  pose  a  problem. 

When  some  subaction  .4,  a  proper  descendant  of  a  concurrent  subaction  S  of  C ,  creates  a 
serialization  constraint  in  C’s  CT  (by  requesting  a  lock  someplace)  and  aborts,  its  constraint 
may  become  invalid.  Such  erroneous  constraints  can  cause  unnecessary  aborts  of  DAs  that  try 
to  violate  the  erroneous  order.  Note,  however,  that  an  abort  of  a  concurrent  subaction  like  S 
causes  no  harm;  while  S  was  active,  no  other  sibling  could  have  been  serialized  after  S. 

The  solution  presented  above  ignores  the  possibility  of  an  abort  of  a  subaction  A  that 
created  a  constraint  in  the  CT.  (The  price  to  be  paid  is  potential  unnecessary  aborts  of  DAs 
when  subactions  like  A  abort).  In  the  following  we  outline  a  way  to  handle  aborts;  it  was  not 
implemented  to  avoid  additional  complexities  in  the  presentation  of  the  mechanism. 

The  idea  behind  handling  aborts  is  keeping  an  accurate  record  of  every  serialization  conflict 
created,  and  discarding  this  record  when  the  action  that  created  that  conflict  (or  one  of  its 
ancestors)  abort. 

Our  simplified  mechanism  (without  caching  of  CTs)  can  implement  the  above  idea  by  im¬ 
plementing  the  CT  entries  as  counters.  Every  constraint  created  is  reported  to  the  central  CT 
by  a  CTquery  message;  if  the  CTreply  approves  the  order,  the  appropriate  counter  would  be 
incremented.  The  new  lock  that  is  acquired  after  the  reply  is  received  is  marked  to  refer  to 
that  CT.  When  a  subaction  A  is  aborted,  and  its  locks  are  released,  then  for  any  marked  lock 
A  had,  a  special  message  would  be  sent  to  the  appropriate  CT  and  the  appropriate  counter 
would  be  decremented.  This  way,  when  a  counter  reaches  its  initial  value  again,  the  serialization 
conflict  would  be  gone.  More  work  is  needed,  however,  to  adapt  this  scheme  to  our  optimized 
mechanism  (which  uses  caching,  etc.). 

4.4  Conclusion 

The  lock  acquisition  and  propagation  scheme  presented  above  ensures  serial  behavior  of  nested 
atomic  actions  systems  that  use  disconnected  actions.  The  above  algorithms  were  simulated 
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successfully,  but  no  performance  measurements  were  taken.  The  performance  is  expected  to 
depend  heavily  on  the  various  optimizations,  some  of  which  were  mentioned  above,  which  were 
not  implemented. 

The  above  scheme  acquires  the  requested  lock  by  checking  all  the  conflicts  with  existing 
locks  and  applying  the  serial  or  concurrent  rules,  depends  on  the  ordering  relation  in  each 
conflict.  Our  choice  of  using  two  mechanisms  was  derived  by  considering  costs;  the  timestamps 
mechanism  is  almost  free,  in  comparison  to  the  CT/FP  scheme  that  carries  a  considerable  price. 
Using  the  CT  mechanism  alone  to  handle  both  cases  is  possible  but  too  expensive. 

The  case  of  nested  DAs  was  elaborated  for  the  serial  DAs.  It  should  make  no  difference  for 
concurrent  DAs  because  the  point  is  to  find  the  serialization  constraint  between  the  concurrent 
subactions,  regardless  of  the  level  of  disconnection. 

Note  also  that  the  case  of  concurrent  subactions,  all  or  some  disconnected,  (e.g.,  in  Figure  6- 
1)  is  handled  correctly  by  using  timestamps  alone;  DAs  running  concurrently  still  obey  the  strict 
two-phase  locking  protocol.  However,  if  any  of  the  concurrent  siblings  (DAs)  creates  (nested) 
DAs,  the  CT  mechanism  has  to  be  used. 
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Chapter  5 


Commit  with  Disconnected  Actions. 


This  chapter  discusses  how  a  nested  atomic  action  system  using  disconnected  actions  would 
commit  its  transactions.  Disconnected  actions,  which  were  introduced  into  our  model  to  exploit 
concurrency  and  improve  performance,  violate  some  of  the  preconditions  required  by  the  commit 
mechanism:  Some  (disconnected)  actions  may  still  be  active,  and  information  needed  by  the 
coordinator  of  the  commit  protocol  may  be  scattered  around  the  system. 

This  chapter  describes  a  modified  mechanism  that  ensures  that  a  system  using  DAs  would 
have  the  desired  commit  semantics.  The  new  commit  mechanism  aborts  active  (disconnected) 
actions,  collects  the  missing  information  and  makes  the  effects  of  the  non-aborted  topaction 
and  all  its  non-aborted  descendants  permanent. 

We  shall  start  by  discussing  how  to  modify  the  existing  Two  Phase  Commit  protocol  to 
handle  the  various  situations  that  may  occur  when  disconnected  actions  are  used,  and  examine 
(in  Section  5.2)  in  detail  how  atomic  objects  at  a  site  that  participated  in  the  transaction  are 
to  be  handled  during  the  execution  of  the  commit  protocol. 

5.1  The  modified  Two  Phase  Commit  Protocol 

The  design  of  the  Two  Phase  Commit  protocol,  presented  in  Chapter  2,  assumes  that  the 
site  where  the  topaction  runs  received  all  the  necessary  information  by  the  time  the  topaction 
decided  to  commit.  This  information  includes  the  names  of  all  the  sites  that  participated  in  that 
transaction,  with  whom  the  topaction’s  site  (the  coordinator)  is  to  perform  the  protocol,  and 
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- time 

Figure  5-1:  A  time-space  diagram  of  the  modified  Two  Phase  Commit  protocol. 


the  names  of  all  the  subactions  that  aborted,  required  by  the  participating  sites  for  determining 
the  correct  states  of  their  objects. 

For  a  nested  atomic  action  system  with  disconnected  actions  the  previous  assumption  may 
not  hold.  When  a  topaction  decides  to  commit  it  may  still  have  active  disconnected  descendants 
that  are  visiting  new  sites  or  producing  new  aborted  subactions,  unknown  to  the  topaction  at 
that  time.  Quiescing  all  the  activity  on  behalf  of  the  topaction  before  it  commits  can  solve  this 
problem,  but  the  imminent  cost  of  communication,  roughly  the  same  as  needed  for  the  Two 
Phase  Commit  protocol1,  voids  such  a  solution.  A  better  approach  is  to  quiesce  the  remaining 
active  (disconnected)  actions  and  catch  up  on  the  missing  information  while  carrying  out  the 
Two  Phase  Commit  protocol,  taking  advantage  of  the  existing  message  flow. 

The  current  Two  Phase  Commit  protocol  [Liskov,  et  at.  1987b]  has  to  be  modified  to  take 
into  consideration  the  (disconnected)  actions  that  are  still  active  and  the  fact  that  not  all  the 
participants  are  known  when  the  protocol  commences. 

'It  would  take  several  rounds  of  messages  from  the  coordinator  to  the  known  participants  until  all  the  par¬ 
ticipants  are  found  and  active  DAs  aborted. 
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The  time-space  diagram  in  Figure  5-1  depicts  the  modified  Two  Phase  Commit  protocol2. 
The  details  are  given  in  Figure  5-2  for  the  coordinator  and  for  the  participants  in  Figures  5-3 
and  5-4.  The  new  protocol  uses  the  first  phase  to  quiesce  active  subactions  and  collect  the 
missing  names  of  participants  and  aborted  (disconnected)  actions.  The  second  phase  is  similar 
to  the  second  phase  of  the  original,  except  that  the  list  of  missing  aborted  subactions  has  to  be 
carried  along. 

The  new  protocol  is  as  follows:  The  coordinator  begins  as  before  by  sending  a  PREPARE 
message  to  all  its  known  participants,  accompanied  by  the  accumulated  list  of  known  aborted 
subactions  ( alist ).  Each  participant  receiving  the  PREPARE  message  would  also  behave  as  in 
the  original  protocol,  unless  it  participated  in  some  disconnected  subaction  of  the  committing 
topaction. 

A  participant  receiving  a  PREPARE  message  will  do  the  following:  Quiesce  (i.e.,  abort) 
all  the  local  active  (disconnected)  subactions  of  that  topaction,  including  both  top  DAs  and 
handler  calls  made  by  (descendants  of)  DAs.  Prepare  the  local  atomic  objects  (see  Section  5.2) 
and  forward  the  PREPARE  message  received  from  the  coordinator  to  the  participants  of  its 
committed  local  DAs.  This  way  the  message  will  get  to  all  the  participants.  Some  participants 
may  get  more  than  one  PREPARE  message,  but  they  can  simply  ignore  all  but  the  first  since 
the  messages  are  identical. 

All  the  participants  that  prepared  successfully  (see  details  in  Section  5.2)  reply  with  an  OK 
message  to  the  coordinator.  A  participant  that  is  unable  to  prepare  (e.g.,  because  it  crashed) 
replies  with  REFUSED  as  in  the  original  protocol.  The  OK  message  is  accompanied  by  two 
lists: 

•  plistd  -  the  list  of  participants  of  committed  DAs,  created  by  combining  the  plists  of  all 
the  committed  local  top  DAs. 

•  alistd  -  the  list  of  aborted  DAs,  composed  of  those  local  top  DAs  that  were  aborted 
during  preparation  and  all  those  in  the  alists  of  the  committed  local  top  DAs. 

The  coordinator  that  receives  OK  messages  from  all  the  participants,  including  those  in  the 

2See  the  original  protocol  in  Figure  2-4  on  page  27 
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%  Simple  version  of  the  coordinator.  No  optimizations. 

%  Some  straightforward  type  specs  are  missing. 

coordinator  =  proc  (topaction:action,  alist:  actionJist,  plist:  siteJist)  returns  (commit.or.abort) 
%  Called  by  the  topaction  to  perform  Two  Phase  Commit 
%  Begin  phase  I. 

for  participants  In  site.list$elements(p!ist)  do 

send..to(participant,“PREPARE’,,topaction,alist) 

end 

alist-rost :  actionJist  :=  actionJist$new() 
msg :  message 

for  participant:site  in  siteJist$e!ements(piist)  do 
msg  :=  receive.from(participant) 
except  when  timeout,  aborted: 

write-tO-stable-Storage(“ABORT”,  topaction,  plist) 
for  participants  in  siteJist$elements(plist)  do 
send.to(participant, “ABORT1, topaction) 
end 

return(abort) 

end 

t  siteJist$add.to(plist,msg.plistjd) 

actionJist$add-to(alist-rest,msg.aiist.d) 

end 

%  End  of  phase  t.  Coordinator  decides  to  commit. 
write.tojstable-storage(“COMMiT’,topaction, plist, aiistjest) 

%  Begin  phase  II. 

for  participants  in  siteJist$e!ements(plist)  do 

send  ^(participant,1 “COMMIT1,  topaction,  alistjest) 

end 

for  participants  in  siteJist$elements(plist)  do  receive.DONE(participant)  end 
write.to_stable_storage(“DONE",  topaction) 

%  End  phase  II. 
return(commit) 
end  coordinator 


Figure  5-2:  Modified  Two  Phase  Commit  -  The  coordinator  part 
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participant-phaseJ  =  proc  (topaction:action,  alist:action_list,  cc:crash.count) 

%  Called  at  a  site  when  a  “PREPARE" message  is  received. 

%  PREPARED  -  list  of  topactions  that  are  locally  prepared.  (Atomic  object  I) 

%  ACTIVE,  COMMITTED -see  Chapter  2  for  detail. 

if  actionJist$is  jn(topaction, PREPARED)  then  return  end  %  Already  received  “PREPARE" 
if  cc  >  LocaLCrash.Count  then  %  Participant  has  crashed  since  visited 
send.to(topaction  .“ABORT) 

return 

end 

action_list$insert(PREPARED, topaction) 
alist-d  :  actionJist  :=  actionJist$new() 
plist-d  :  site-list  :=  siteJist$new() 
for  active-action :action  in  activeJist$elements(ACTIVE)  do 
if  aid$disconnected(active.action)  cand 

aid$descendant(active.action  .topaction)  then 
%  Found  an  active  DA,  abort  it. 
action  Jist$remove(ACTIVE,active-actton) 
if  aid$top-DA(active.action)  then 

actionJist$insert(alist-d,active-action) 

end 

end 

end 

for  committed-action raction  in  committedJist$elements(COMMITTED)  do 
if  action  Jist$ancestorJs-in(alist,committed.action)  then 

committedJist$remove(COMMITTED,committed.action) 

continue 

end 

if  aid$disconnected(committedJaction)  cand 

aid$descendant(committed.action, topaction)  then 
%  Found  a  committed  top  DA,  get  its  alist  and  plist 
actionJist$add.to(alist.d,geLalist(COMMlTTED, committed-action)) 
siteJist$add-to(plist-d,get43iist(COMMITTED1committed^ction)) 
end 
end 

prepare(topaction.alist)  %  see  Section  5.2 
%  Forward  to  participants  of  committed  DAs 
for  participant:site  in  siteJist$e!ements(plist.d)  do 

send.to(participant, “PREPARE”, topaction.alist) 

end 

send-to(topaction,“OK”,plist-d,alist-d) 
end  participant-phaseJ 


Figure  5-3:  Modified  Two  Phase  Commit  -  The  participant  part,  phase  I 
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plistdS 3  ,  can  decide  to  commit  the  topaction.  (In  Chapter  6  we  explore  the  possibility  of  doing 
so  before  all  the  participants  have  replied,  or  even  if  some  refused.)  The  coordinator  knows,  at 
this  point,  about  all  the  participants  and  all  the  aborted  subactions.  It  has  to  process  the  lists 
returned  with  the  OK  replies:  Add  those  new  unknown  participants  to  its  plist,  and  merge  the 
lists  of  aborted  (disconnected)  subactions  (i.e.  alistd)  into  a  new  alist,  alistTest. 

After  making  its  decision,  the  coordinator  can  start  the  second  commit  phase.  As  in  the 
original  protocol,  a  COMMIT  (or  ABORT)  message  is  sent  to  all  the  participants.  Since  the 
original  alist,  which  was  sent  with  the  PREPARE  message,  lacked  the  aborted  disconnected 
actions,  the  missing  actions  must  be  passed  to  the  participants  during  the  second  phase.  The 
participants  that  receive  the  CO MMIT  message  with  the  new  ali$tTest  have  at  this  point  enough 
information  to  make  the  transaction’s  modifications  permanent.  They  reply  with  DONE  once 
done. 

Note  that  the  recursive  case  (i.e.,  nested  DAs)  is  handled  correctly  by  our  scheme.  Par¬ 
ticipants  receiving  forwarded  PREPARE  messages  can  forward  them  again.  The  PREPARE 
message  would  propagate  to  all  the  participants  and  they  would  all  become  known  to  the 
coordinator  before  it  has  to  decide  about  the  transaction’s  fate. 


participant-phase.ll  =  proc  (fate:string,  topaction:action,  alist.rest:actionJist) 
%  Called  at  a  site  when  a  “COM MIT/" ABORT'  message  is  received. 
if  fate  =  “COMMITTED”  then 

commit-prepared(topaction,alist.rest)  %  see  Section  5.2 
send.to(topaction  ."DONE") 
else  %  fate  =  “ABORT' 

abort-prepared(topaction)  %  see  Section  5.2 

end 

actionJist$remove(PREPARED,topaction) 
end  participanLphase.il 


Figure  5-4:  Modified  Two  Phase  Commit  -  The  participant  part,  phase  II 


3Note  the  iteration  semantics  (in  Figure  5-2,  line  marked  with  “f”  )  :  New  sites  from  phstd,  added  to  plist 
would  be  yielded  as  participants  in  the  following  iterations. 
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5.1.1  Optimizations  and  variations 

The  modified  Two  Phase  Commit  protocol  proposed  above  can  be  further  optimized  for  better 
performance: 

•  The  coordinator  can  add  its  plist  to  the  PREPARE  message.  This  way  a  participant  may 
save  on  forwarding  the  message  to  some  of  the  participants  of  the  disconnected  actions  as 
they  are  already  known  by  the  coordinator,  and  return  a  smaller  plistj  to  the  coordinator. 

•  A  participant  may  delay  aborting  some  DAs  after  a  PREPARE  message  is  received  to 
give  them  a  chance  to  commit.  The  delay  period  has  to  be  set  empirically  or  be  given  by 
the  programmer. 

•  A  participant  may  add  its  alistd  to  the  alist  of  the  PREPARE  message  it  forwards  to 
the  participants  in  its  plistj.  This  way  those  participants  would  have  to  write  less  to 
stable  storage  (see  Section  5.2),  but  would  have  to  monitor  all  the  incoming  PREPARE 
messages  for  different  alists. 


5.2  Preparing  an  atomic  object  at  a  site 

Preparing  an  atomic  object  that  was  modified  by  disconnected  actions  is  different  from  the 
regular  case:  The  object’s  site  may  not  have  enough  information  about  the  actions  that  modified 
the  object  during  the  first  commit  phase,  and  therefore  it  has  to  defer  some  decisions  to  the 
second  phase. 

An  atomic  object  that  was  modified  by  an  atomic  transaction  has  an  (stack)  ordered  list  of 
potential  versions.  In  a  model  without  DAs,  a  participant  receives  enough  information  by  the 
time  it  has  to  prepare  to  find  the  topmost  version  that  belongs  to  a  non-aborted  descendant 
of  the  topaction.  The  participant  prepares  by  writing  that  version  to  stable  storage4;  all  the 
other  versions  can  be  discarded.  The  decision  to  abort  or  commit  the  topaction  will  determine 
which  version  will  prevail  -  that  top  one  or  the  base  version. 

1  Writing  a  version  to  stable  storage  includes  also  non-stable  objects  that  are  accessible  from  this  version.  See 
[Oki,  Liskov  &  Scheifler  1985]  and  [Kolodner  89]. 


75 


- 1 

4.2.3.D1.5  I 


4.2.3.01 


4.2.3 


4.2 


4 


-« - TOP 


TOPMOST  REGULAR 


-■ - BOTTOM 


BASE 


Figure  5-5:  An  example  showing  versions  of  an  object  to  be  prepared 


When  DAs  are  used  and  the  above  modified  Two  Phase  Commit  protocol  is  obeyed,  a  more 
complex  scheme  has  to  be  used  to  ensure  that  the  topmost  committed  version  of  an  atomic 
object  will  be  available  when  and  if  the  topaction  decides  to  commit.  A  participant  may  not 
have  all  the  information  it  needs  by  the  time  it  has  to  prepare  to  determine  the  topmost  version 
of  a  committed  subaction. 

Figure  5-5  shows  an  example  of  an  object  at  a  site  at  the  time  when  a  PREPARE  message 
was  received.  Note  that  since  the  locking  rules  enforce  strict  ancestry  order  among  the  versions, 
an  object  modified  by  DAs  must  have  DAs1  versions  above  the  regular  versions.  Therefore  if 
none  of  the  actions  holding  versions  on  the  object  have  an  ancestor  in  the  PREPARE  alist ,  any 
of  the  topmost  regular  version  or  the  DAs’  versions  could  be  the  final  one.  In  the  example,  if  no 
ancestor  of  4.2.3  was  in  the  PREPARE  alist ,  the  preparing  participant  can  not  tell  which  of 
4.2.3,  4.2.3.D1  or  4.2.3. 01.5’s  versions  will  replace  the  base  version  if  the  coordinator  decides 
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% 

prepare  =  proc  (topaction:action,  alist:actionJist) 

%  called  by  participant^hase.l 
% 

prepared.objects :  object-list  :=  object.list$new() 
for  obj:object  in  committedJist$alLobjects(COMMiTTED,topaction)  do 
for  venversion  in  object$versions.top.to±>ottom(obj)  do 
if  action.list$ancestor  jsJn(alist,ver.aid)  then 
object$discard.version(obj,ver) 
elseif  aid$not-disconnected(ver.aid)  then 

object$discard-versionsJbelow.version(obj,ver) 

break 

end 

end 

if  object$have-no.versions(obj)  then  %  Read  only  object 
object$re!ease.alLlocks(obj) 

else 

objectJist$add.object(prepared.objects,obj) 

%  write  remaining  versions  to  stable  storage 
object$save.versions(obj) 

end 

save.record(“PREPARED,,,topaction1preparedjobjects) 
end  prepare 


Figure  5-6:  Preparing  objects  at  the  participant’s  site 


to  commit. 

Preparing  an  object  with  DAs’  versions  requires  writing  all  these  versions  to  stable  storage, 
in  addition  to  the  topmost  regular  version,  unless  some  ancestor  is  known  to  be  aborted.  Fig¬ 
ure  5-6  specifies  the  steps  to  be  taken:  When  a  PREPARE  message  is  received,  the  versions  on 
each  of  the  topaction’s  object  are  scanned  top  to  bottom.  Versions  of  aborted  subactions  are 
discarded,  and  the  remaining  DAs’  versions  and  the  topmost  regular  version,  if  any  exist,  are 
written  to  stable  storage.  All  the  regular  versions  except  the  topmost  one  are  discarded. 

After  all  the  topaction’s  objects  have  been  prepared  successfully,  a  recbrd  is  written  to  stable 
storage  to  mark  that  fact.  From  this  point  on  the  site  is  committed  to  keeping  all  the  prepared 
objects  until  notified  about  the  coordinator’s  final  decision. 

The  participant’s  work  can  be  finished  after  the  coordinator  decision  is  received.  As  in  the 
original  protocol,  when  the  decision  is  to  abort  the  transaction,  all  the  prepared  versions  and 
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% 

abort-prepared  =  proc  (topaclion:action) 
%  called  by  participant43hase.il 
% 


prepared.objects :  object-list  :=  get-record(“PREPARED”,topaction) 
for  objiobject  in  objectJistSelements(prepared-objects)  do 
discard-prepared.material(obj) 
object$release.alLlocks(obj) 

end 

end  abort-prepared 


Figure  5-7:  Abort:  Discarding  prepared  modifications  at  a  participating  site. 


% 

commit-prepared  =  proc  (topaction:action,  al  ist.rest  :action  J  ist) 

%  called  by  participant43hase.ll 
% 

prepared-objects  :  object-list  :=  get.record(“PREPARED”,topaction) 
for  objiobject  in  objectJist$elements(prepared.objects)  do 

for  veriversion  in  object$vsrsions-top.toJbottom(obj)  do 

if  action.list$ancestor  js-NOTJn(alist.rest,ver.aid)  then 
object$replace-base.version.with.ver(obj,ver) 
break 
end 
end 

discard-prepared-material(obj) 

object$release.all-locks(obj) 

end 

end  commit.prepared 


Figure  5-8:  Commit:  Making  prepared  modifications  permanent  at  a  participating  site. 


locks  arc  discarded  (Figure  5-7).  When  the  coordinator  decides  to  commit,  the  participant  has 
to  finish  the  work  from  the  first  phase  (i.c.,  identify  versions  of  aborted  subnotions)  before  it  can 
replace  the  base  version  with  the  correct  new  version.  As  detailed  in  Figure  .r)-8,  the  participant 
scans,  for  every  object,  the  prepared  versions  top  to  bottom.  A  version  found  to  belong  to  an 
aborted  (disconnected)  subaction  is  skipped;  the  first  to  belong  to  non-aborted  .subaction  (if 
any)  replaces  the  base  version. 
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Chapter  6 


Fault  tolerant  Two  Phase  Commit 


In  this  chapter  we  explore  the  possibility  of  successfully  committing  a  transaction  in  spite  of 
some  failures.  That  is,  we  want  the  Two  Phase  Commit  coordinator  to  be  able  to  ignore  several 
participants  that  failed  and  commit  the  transaction  without  them. 

The  ability  to  ignore  some  participants  gives  more  fault  tolerance  to  our  model.  In  the 
current  Two  Phase  Commit  protocol  ([Gray  1978]),  a  failure  of  any  participant  either  causes 
the  transaction  to  abort  or  delays  its  commit. 

The  basic  idea  behind  this  chapter  is  that  some  DAonly  participants,  those  that  were  used 
only  by  DAs1,  are  dispensable.  Since  the  effects  created  by  Disconnected  Actions  are  not 
essential  to  their  creators’  success,  the  transaction  can  commit  without  some  of  its  DAonly 
participants. 

A  simple  example  is  the  replication  application  from  Section  1.1,  depicted  in  Figure  6-1. 
The  topaction  A  ran  at  Ga  and  created  five  DAs  (A.D  1, ..,  A.Db)  to  update  the  object  0  that 
is  replicated  at  Gdi,  -,Gd5-  After  a  majority  of  the  DAs  (i.e.,  three)  committed,  A  continued 
running  until  it  decided  to  commit.  Meantime  the  remaining  (two)  DAs  also  committed  to  A. 
The  transaction  A  had  not  modified  any  objects  besides  the  replicas  Oi,..,Os. 

The  coordinator  of  A’s  Two  Phase  Commit  can  commit  A  even  if  one  or  two  of  the  par¬ 
ticipants  (Gui,..,  G'ijs)  failed  to  reply  in  time  to  the  PREPARE  message  or  replied  with  RE¬ 
FUSED !  All  the  coordinator  has  to  guarantee  is  that  at  least  a  majority  of  the  participants  did 

JNote  that  these  are  exactly  those  participants  of  the  plista s  that  are  not  in  the  coordinator’s  initial  plist. 
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Figure  6-1:  The  replication  example:  Up  to  2  sites  can  be  ignored  during  Two  Phase  Commit. 

manage  to  prepare,  hence  ensuring  consistency  of  the  replicated  object. 

The  general  case  is  not  as  straightforward  as  the  previous  example.  In  Section  6.1  we 
study  the  implications  of  ignoring  DAonly  participants.  In  Sections  6.2  and  6.3  we  propose  two 
olutions:  introduction  of  a  Disconnected  Nested  Topaction  and  modification  of  the  existing 
model.  In  Section  6.4  we  summarize  and  evaluate  the  merits  of  the  ability  to  ignore  participants. 

6.1  Problems  with  Ignoring  Disconnected  Participants 

Though  in  some  special  cases  ignoring  a  participant  has  the  effect  of  aborting  a  DA,  the  general 
case  is  far  more  complicated.  Figure  6-2  demonstrates  two  possible  problems:  A  DA  affecting 
more  than  one  site  and  a  subaction  depending  on  a  DA. 
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Figure  6-2:  Problems  with  ignoring  DAonly  participants. 


The  problem  of  a  DA  creating  effects  at  several  participants  is  that  ignoring  one  failed  site 
is  not  equivalent  to  aborting  the  DAs  that  ran  there.  The  problem  is  not  trivial  because  the 
coordinator  does  not  necessarily  know  which  DAs  ran  at  a  certain  site.  Had  it  known,  it  could 
have  added  those  DAs  to  its  alistre3t,  effectively  aborting  those  DAs. 

For  example,  assume  that  the  DA  AA.D1  has  modified  two  objects,  0\  at  site  G\  and  O2  at 
Gdu  preserving  some  invariant  between  the  two  objects  (e.g.  value  equality).  An  inconsistent 
state  will  be  created  if  the  coordinator  ignores  Gdi  (e.g.,  because  Gdi  crashed)  and  decides 
to  commit  A,  making  A’s  effects  permanent  at  all  the  other  participants.  Had  the  coordinator 
known  that  the  DA  A.1.D1.1  ran  at  Gdu  it  could  have  aborted  A.1.D1.1  by  adding  it  to  its 
alistTtat ,  preventing  the  problem  of  possible  inconsistent  state. 

With  the  modified  Two  Phase  Commit  protocol,  discussed  in  Chapter  5,  the  coordinator  can 
not  tell  which  DAs  ran  at  a  particular  (  DAonly  )  participant2.  This  information  is  distributed 
among  the  participants  where  the  DAs  were  created,  and  is  discarded  by  the  end  of  the  first 
commit  phase. 

The  second  problem  concerns  the  dependency  of  other  subactions  on  a  commit  of  a  DA. 
Aborting  this  DA  would  nullify  these  subactions  and  necessitate  aborting  them,  which  may  not 
be  possible  when  the  dependent  subactions  are  not  disconnected. 

For  example,  subaction  A. 3  in  Figure  6-2  has  read  the  object  O3  at  site  G3,  which  was 
modified  by  A.2.DI.I  that  committed.  Ignoring  Gdz,  which  implies  aborting  A.2.D1,  would 
force  the  whole  transaction  to  abort  since  the  effects  of  A. 3,  a  regular  subaction,  can  not  be 
undone.  In  the  case  where  a  subaction  like  A. 3  is  disconnected,  cascading  aborts  can  take  place 
and  allow  the  transaction  to  commit,  but  the  coordinator  has  to  know  about  the  dependency. 

6.2  First  solution:  Disconnected  Nested  Topactions 

The  two  problems  mentioned  above  can  be  avoided  with  a  relatively  simple  mechanism:  The 
Disconnected  nested  topaction.  This  new  feature  is  basically  a  nested  topaction  that  executes 
asynchronously  with  its  creator,  just  like  a  DA,  but  unlike  a  nested  topaction,  the  discon¬ 
nected  nested  topaction  ( DNTA )  can  commit  only  if  the  topaction  of  its  creator  (i.e.  the  main 

2 Assume  that,  in  our  example,  other  DAs  were  also  created  at  Gdi- 
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topaction)  commits  and  none  of  its  ancestors  abort. 

The  DNTA  solves  the  problem  of  other  subactions  that  observe  its  effects  by  simply  prevent¬ 
ing  it.  Like  a  regular  nested  topaction,  the  DNTA  is  not  considered  a  descendant  of  its  creator 
for  the  purpose  of  lock  inheritance  or  acquisition.  The  DNTA  can  not  get  a  write-lock  on  an 
object  locked  by  a  descendant  of  the  main  topaction,  or  read  an  object  that  has  a  write-lock 
on  behalf  of  the  main  topaction,  and  the  same  applies  in  the  other  direction. 

The  problem  of  a  DA  affecting  several  sites  is  solved  by  having  centralized  control  over  the 
whole  DNTA  at  its  coordinator.  The  DNTA’s  coordinator  isolates  the  main  coordinator  from 
the  DNTA’s  Two  Phase  Commit;  it  knows  of  all  the  DNTA’s  participants  and  when  one  has  to 
be  ignored,  it  can  abort  the  whole  DNTA. 

6.2.1  Implementation  of  the  DNTA 

The  Disconnected  Nested  Topaction  starts  as  a  DA,  runs  as  a  nested  topaction  and  ends  up 
like  a  DA  again.  This  way  a  minimal  change  to  the  existing  implementation  is  required.  The 
DNTA  is  a  nested  topaction  when  needed  to  cope  with  the  previous  two  problems,  and  is  a  DA 
for  the  purpose  of  committing  it  as  part  of  the  main  topaction. 

The  implementation  of  the  DNTA  is  as  follows: 

•  The  DNTA  is  created  as  a  top  DA3,  and  is  added  to  the  list  of  active  local  actions. 

•  For  lock  propagation  purposes  the  DNTA  is  considered  a  nested  topaction. 

•  The  nested  topaction  of  the  DNTA  runs  and  commits  as  usual  until  it  finishes  phase  one 
of  its  Two  Phase  Commit  (see  Figure  5-2).  If  its  coordinator  decided  to  commit,  it  does 
not  write  to  stable  storage  or  send  any  messages,  but  instead  commits  like  a  top  DA  and 
keeps  (in  its  COMMITTED  entry)  its  plist  (and  its  alistreal  if  it  had  nested  DAs);  there 
is  no  need  to  keep  the  list  of  objects  since  they  are  all  prepared  at  their  sites  (those  in 
the  plist). 

Note  that  getting  to  this  point  means  that  none  of  the  DNTA’s  participants  aborted,  and 
they  are  all  at  the  “prepared”  state.  Had  any  participant  failed,  the  coordinator  of  the 

3The  DTNA  would  need  a  special  tag  in  its  AID  if  DTNAs  and  DAs  are  to  coexist  in  the  system. 


84 


nested  topaction  would  have  had  to  abort  it,  notify  its  participants  and  remove  all  traces 
of  it  from  the  local  site.  Note  that  a  crash  of  the  local  site  (before  preparing  for  the  main 
topaction)  would  have  the  same  effect  as  aborting  except  that  messages  are  not  sent  to 
the  participants.  In  this  case  the  participants  would  have  to  initiate  queries  to  find  that 
their  DNTA  aborted. 

•  The  modified  Two  Phase  Commit  (described  in  Chapter  5)  of  the  main  topaction  would 
finish  up  the  DNTAs’  work  correctly.  In  the  prepare  phase,  those  DNTAs  with  aborted 
ancestors  would  be  aborted  (see  Figure  5-34),  and  the  plists  (and  alists  in  the  recursive 
case)  of  the  committed  DNTAs  would  be  added  to  the  local  plistd  (and  alistd).  This  way 
the  second  phase  of  the  DNTA’s  commit  would  be  handled  by  the  second  commit  phase 
of  the  main  topaction. 

A  small  semantic  change  is  needed,  though,  in  the  second  cl  nmit  phase.  The  coordina¬ 
tor’s  decision  message  should  apply  not  only  to  its  transaction  but  to  its  (separately  kept) 
DNTAs  at  the  participants.  The  algorithms  in  Chapter  5  need  not  be  changed,  except 
that  the  “get_record”  operation,  used  in  Figures  5-8  and  5-7,  should  be  specified  to  also 
retrieve  the  records  of  the  transaction’s  DNTAs. 

6.2.2  Evaluation  of  DNTAs 

The  main  advantage  of  the  Disconnected  Nested  Topaction  is  its  simple  implementation.  As 
seen  above,  only  three  minor  changes  have  to  be  made  to  our  model  in  order  to  have  DNTAs 
in  it. 

The  main  disadvantage  of  DNTAs  is  their  restricted  functionality.  They  can  not  inherit 
(or  release)  locks  from  (or  to)  their  creators.  Programmers  have  to  be  very  careful  to  avoid 
deadlocks  with  the  use  of  DNTAs.  For  example,  a  transaction  trying  to  read  an  object  that 
was  modified  by  one  of  its  DNTAs  would  wait  forever. 

Quorum-sets  can  be  done  with  DNTAs  instead  of  DAs.  They  would  take  longer  to  run 
because  a  prepare  phase  is  involved,  but  any  committed  DNTA  is  guaranteed  to  survive  crashes 
of  its  participants  (except  its  coordinator’s  site). 

4  At  that  stage  we  can  add  sending  abort  messages  to  the  participants  of  the  aborted  DNTAs. 
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Figure  6-3:  An  abstract  view  of  a  possible  maplist 


Note  that,  though  restricting  lock  exchange,  the  DNTA  releases  its  read-locks  after  finishing 
its  first  commit  phase,  therefore  enabling  its  main  topaction  to  get  write  locks  on  objects  that 
were  only  read  by  DNTAs. 


6.3  Second  solution:  Modify  the  commit  mechanism 

Site  ignoring  can  also  be  achieved  by  some  modifications  to  the  commit  mechanism.  The 
modified  mechanism  enables  the  coordinator  to  deduce  which  DAs  ran  at  a  certain  site  and  on 
which  (DAonly)  sites  committed  regular  subactions  depend. 

To  provide  the  coordinator  with  the  mapping  from  a  site  to  the  names  of  the  DAs  that  ran 
there,  the  participants  have  to  send  this  information  to  the  coordinator.  Each  participant  has 
the  participants  lists  for  the  committed  top  DAs.  I.e.,  the  mapping  F: 

V  DA  €  COMMITTED,  F  :  DA  ■ — ■>  {participant, participants,  •  •  •} 

By  combining  all  the  maps  it  receives,  the  coordinator  has  the  inverse  mapping  F~r\ 

V  Participant  €  plist,  F~l  :  Participant  — >  {DA\,  DAs,  •  •  •} 

and  can  abort  all  the  DAs  that  ran  at  the  sites  it  ignored  by  adding  them  to  its  alistrest. 

The  implementation  is  simple;  each  participant  has  to  replace  its  plistd,  returned  with  its 
OK  reply,  with  a  maplist,  a  list  of  committed  local  top  DAs,  each  DA  with  its  plist.  An  example 
for  an  abstract  maplist  is  given  in  Figure  6-3.  The  size  of  the  maplist  will  be  proportional  to 
the  number  of  DAs,  but  can  be  reduced  if  we  use  the  optimization  of  adding  the  coordinator 
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plist  to  the  PREPARE  message;  a  DA  that  has  all  its  participants  in  the  PREPARE  plist  need 
not  be  reported  since  none  of  its  participants  can  be  ignored. 

Another  problem  to  handle  is  DAs  of  quorum-sets.  Suppose  some  action  created  a  set  of 
n  DAs,  required  that  at  least  q  commit  and  m  eventually  did  (q  <  m  <  n ).  The  coordinator 
should  not  be  allowed  to  violate  this  requirement.  The  participant  should  pass  this  information 
in  the  maplist  with  its  OK  reply,  and  the  coordinator  should  check  that  the  minimum  for  each 
set  is  not  violated  by  ignoring  some  participants5.  The  example  in  Figure  6-3  shows  two  such 
sets;  at  least  two  DAs  out  of  {DA\,  DAi,  DA3}  and  three  out  of  {DA5, . . . ,  DA$]  must  commit. 

Another  way  to  abort  DAs  that  ran  at  an  ignored  site  is  by  letting  the  participants  abort 
DAs.  The  coordinator  wjll  decide  on  ignoring  a  participant  and  tell  the  other  participants  about 
it.  Each  participant  will  use  its  local  mapping  to  deduce  which.  DAs  should  be  aborted,  and 
notify  the  sites  in  then  plists.  This  way  less  information  has  to  be  sent  to  the  coordinator,  but 
handling  DAs  of  quorum  sets  would  be  more  complicated. 

The  problem  of  a  regular  subaction  S  that  depends  on  the  commit  of  a  DA  D  can  be  solved 
by  letting  the  coordinator  know  about  S' s  dependency  on  D's  DAonly  participants,  hence 
preventing  the  coordinator  from  ignoring  DAonly  participants  upon  which  committed  regular 
subactions  depend.  This  knowledge  can  be  easily  obtained  from  the  mechanism  that  detects 
Crash  Orphans  ([Liskov,  et  al.  1987a]).  If  such  detection  mechanism  is  not  implemented,  a 
simpler  mechanism  can  be  devised  (see  Section  6.3.1)  to  acquire  the  dependency  information. 

The  Crash  Orphans  mechanism  works  by  maintaining,  for  every  action,  a  list  of  sites  on 
which  the  action  depends  ( dlist ).  Besides  regular  sites,  the  dlist  contains  (possibly  DAonly) 
sites  that  were  used  by  DAs  upon  whom  the  action  depends,  as  described  in  Section  3.4.  In 
the  above  example,  when  A. 3  tries  to  get  the  lock  on  O3,  the  query  to  G2  regarding  A.2.DVs 
fate  would  return  with  A.2.Dl's  dlist  (if  it  committed),  which  would  be  merged  to  A.3’s  dlist. 
If  (some  ancestor  of)  A. 3  aborted,  the  dlist  would  be  discarded  and  not  propagated  up  to  the 
topaction. 

All  that  has  to  be  done  to  prevent  the  dependency  problem  is  to  give  the  topaction ’s  dlist 
to  the  coordinator.  The  coordinatoi  is  not  allowed  to  ignore  any  member  of  this  dlist.  This 

5The  participant  can  find  this  information  in  the  entry  of  the  concurrent-set  action  in  COMMITTED  (see 
Section  3.2.2) 
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solution  is  correct  because  all  the  DAonly  sites  upon  which  a  regular  action  depends  must  be  in 
its  dlist,  and  when  that  action  commits  those  sites  are  inserted  into  its  ancestor’s  dlist.  (Note 
that  dlists  of  aborted  subactions  are  discarded.)  To  reduce  the  expected  size  of  the  maplists 
that  accompany  OK  replies,  the  coordinator  can  add  the  topaction’s  dlist  to  its  PREPARE 
message  (replacing  the  plist  in  the  optimization  mentioned  above). 

6.3.1  A  new  dependency  detection  mechanism 

If  the  system  does  not  use  the  Crash  Orphan  Detection  mechanism,  a  smaller  mechanism  can 
be  devised  to  serve  our  need.  The  new  dependency  detection  mechanism  is  basically  a  subset 
of  the  Crash  Orphan  Detection  mechanism,  though  much  simpler  because  it  does  not  need  to 
catch  crash  orphans  “on  the  fly”,  hence  maps  of  crashcount  need  not  be  maintained  and  sent 
with  every  message.  Another  simplification  comes  from  not  using  dlists;  we  need  not  keep  a 
total  dependency  relation,  only  dependency  on  sites  used  by  DAs,  and  can  use  the  actions’ 
plists  to  convey  this  information  to  the  coordinator. 

The  idea  behind  the  new  mechanism  is  that  when  any  action  A  acquires  a  lock  on  an  object 
0,  where  committed  DAs  had  versions  (i.e.,  write  locks),  the  plists  of  those  DAs  are  to  be 
merged  into  As  plist.  This  way  the  coordinator  would  end  up  with  some  DAonly  participants 
in  it  initial  plist,  which  are  those  DAonly  participants  upon  which  committed  regular  subactions 
depend,  and  would  not  consider  them  as  DAonly  (i.e.,  ignorable). 

This  new  use  of  the  plist  may  change  its  semantics  a  bit,  since  it  may  contain  sites  that 
were  not  visited  by  its  action’s  descendants,  but  will  cause  no  problem  because  those  semantics 
are  only  required  from  the  plist the  plist  that  the  coordinator  has  by  the  time  it  has  to 
make  the  commit  decision,  and  the  following  invariant  holds  (for  a  topaction  T ): 

faction  €  Descendants(T )  :  plist{action )  C  newplist(action )  C  dlist( action)  C  plist jinai(T) 

That  is,  the  new  plist  may  have  more  sites  than  the  original,  but  no  more  than  the  dlist  that  is 
used  when  the  Crash  Orphan  Detection  mechanism  is  used,  and  no  foreign  sites  are  introduced 
into  the  coordinator’s  plist. 

For  completeness,  the  details  of  the  new  mechanism  are  given  below  (with  references  to  the 
example  in  Figure  6-2  for  clarity).  Note  that  this  is  basically  the  same  algorithm  as  was  given 
in  Section  3.4,  only  the  plist  is  used  instead  of  the  dlist. 
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•  When  a  query  about  a  committed  top  DA  is  replied  with  “committed”,  the  DA’s  plist 
accompanies  the  reply.  (Such  a  query  must  bo  sent  from  G3  to  G 2  before  A. 3  can  have  a 
conflicting  lock  on  O3.) 

•  An  (initially  empty)  plist  is  associated  with  every  object  version. 

•  A  site  receiving  a  query  reply  with  a  DA’s  plist  would  do  the  lock  propagation  as  usual, 
but  when  a  version  is  being  inherited  (e.g.  A.2.Z>l.l’s  — >  A),  the  plist  received  is  merged 
into  the  version’s  plist.  If  the  inheritor  (A)  already  had  a  version,  the  other  two  plists 
(A.2.jD1.1’s  and  the  one  received)  are  merged  into  its  (A’s)  plist. 

•  When  an  action  gets  a  (read  or  write)  lock  on  an  object,  all  the  plists  on  the  version  read 
are  merged  into  the  action’s  plist. 

6.3.2  Evaluation 

Unlike  the  DNTA  solution,  the  one  above  is  transparent  to  programmers;  it  puts  no  program¬ 
ming  restrictions.  When  fault  tolerance  during  the  Two  Phase  Commit  process  is  important, 
dependencies  among  subactions  (such  as  a  regular  subaction  reading  the  effects  of  many  DAs) 
should  be  avoided. 

This  solution  also  improves  the  performance  of  the  Two  Phase  Commit  somewhat  by  making 
more  participants  known  to  the  coordinator  before  it  begins  the  first  phase. 

The  disadvantage  of  this  solution  is  its  performance  cost.  The  maps  that  are  sent  with  the 
OK  reply  may  be  large,  and  the  Orphan  Detection  mechanism  is  expensive  (see  [Nguyen  1988]). 
Without  Orphan  Detection,  our  new  dependency  detection  mechanism,  which  also  has  some 
cost,  has  to  be  implemented. 

6.4  Summary 

In  this  chapter  we  presented  the  idea  of  completing  the  Two  Phase  Commit  process  successfully 
in  spite  of  failures.  This  new  ability  is  derived  from  the  semantics  of  disconnected  actions, 
namely  doing  work  that  is  not  necessary  for  the  correctness  of  their  creators 
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In  application!;  like  replication  using  majorities,  the  semantics  of  a  DA  may  change  as  it 
commits  and  becomes  part  of  the  quorum,  thus  doing  essential  work  like  an’  other  regular 
subaction.  Nevertheless,  its  old  semantics  can  be  applied  again  during  the  Two  Phase  Commit 
process  to  tolerate  failures  if  more  than  the  minimal  quorum  has  eventually  committed. 

The  ability  to  ignore  several  participants  can  be  very  useful  in  large  scale  distributed  sys¬ 
tems.  A  system  like  The  ClearingHouse  ([Oppen  &  Dalai  1981])  maintains  a  naming  database 
over  more  than  a  thousand  sites,  and  the  likelihood  of  a  failure  occurring  during  the  lifetime 
of  a  transaction  can  not  be  overlooked.  The  ClearingHouse  solved  this  problem  by  not  using 
atomic  transactions.  With  the  scheme  presented  here,  atomic  transactions  can  be  used  in  large 
scale  distributed  systems. 

Another  benefit  from  the  new  ability  is  shorter  timeout  periods.  In  the  original  model, 
timeouts  were  used  by  the  Two  Phase  Commit  protocol  to  decide  that  aborting  the  transaction 
(and  possibly  retrying)  is  more  practical  than  waiting  for  a  (failed)  participant.  Those  timeout 
'eriods  had  to  take  into  account  the  worst  possible  case:  Longest  message  round  trip  time, 

'est  processing  time  and  longest  disk  write  time.  Practical  timeout  periods  are  order  of 
magnitudes  longer  than  the  average  case.  With  the  ability  to  ignore  participants,  shorter 
timeout  periods  can  be  used  as  a  failed  participant  does  not  necessarily  means  an  aborted 
transaction. 

A  point  that  was  hardly  mentioned  in  the  above  chapter  is  recursively  created  DAs.  It  seem 
to  us  that  all  the  work  presented  here  can  be  generalized  in  a  straightforward  manner  to  include 
nested  DAs;  DTNAs  can  be  nested,  and  in  the  second  solution,  the  coordinator’s  handling  of 
nested  DAs  is  similar  to  handling  non-nested  ones. 

Two  possible  implementations  were  proposed:  Disconnected  Nested  Topaction  and  a  modi¬ 
fied  commit  mechanism.  The  trade-off  is  between  simplicity  of  implementation  and  functional¬ 
ity.  We  tend  to  prefer  the  second  solution  as  we  do  not  trust  the  skills  of  programmers,  who  can 
easily  misuse  DNTAs.  The  second  solution  is  also  not  difficult  to  implement  when  the  system 
uses  the  Crash  Orphan  Detection  mechanism. 
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Chapter  7 


Conclusion 


Below  we  summarize  our  work  and  propose  directions  for  further  research. 

7.1  Summary 

The  ability  to  perform  work  on  behalf  of  a  nested  action  in  parallel  with  its  transaction  can 
enhance  a  nested  atomic  action  system  to  better  exploit  potential  parallelism.  The  disconnected 
action,  proposed  in  this  thesis,  provides  a  methodical  way  to  perform  such  an  activity. 

Disconnected  actions  (DAs)  can  be  used  to  perform  benevolent  side-effects  within  the  action. 
Unlike  a  nested  topaction,  which  creates  side-effects  independent  of  its  creator,  modifications 
made  by  DAs  live  within  the  scope  of  their  creator;  when  an  action  aborts,  all  the  modifications 
made  by  its  disconnected  (and  regular)  descendants  are  undone.  Also,  unlike  a  nested  topaction, 
the  DA  can  observe  effects  that  were  created  by  its  transaction  (prior  to  the  creation  on  that 
DA)  and  execute  in  parallel  with  its  transaction. 

We  described  the  current  model  of  computation  that  is  used  by  the  Argus  system;  this 
model  was  the  basis  to  our  work.  We  made  some  small  changes  to  the  current  model  that 
helped  us  reason  about  our  work.  We  separated  the  implementation  of  locks  into  an  unordered 
set  of  read-locks  and  a  stack  of  write-locks  (versions)  ordered  in  ancestry  order,  we  redesigned 
the  structure  of  the  action  identifier  (AID)  to  reflect  accurately  the  action’s  position  in  its 
transaction’s  action-tree,  we  introduced  the  concurrent-set  action  to  help  with  problems  special 
to  concurrent  subactions,  and  we  stated  clearly  the  way  dependency  lists  ( dlists )  should  be 
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handled  for  orphan  detection. 

In  Chapter  3  we  described  the  role,  use  and  implementation  of  DAs  in  our  new  model.  We 
showed  how  the  lock  query  mechanism  and  the  orphan  detection  mechanism  can  be  modified 
to  work  correctly  when  DAs  are  used. 

We  proposed,  in  Chapter  4,  a  technique  to  serialize  disconnected  actions.  For  serial  DAs, 
we  used  timestamps,  carried  with  each  action  and  left  on  every  lock,  to  tell  a  DA  which  effects 
were  created  prior  to  its  creation  and  which  are  not;  in  this  way  we  prevent  non-serializable 
executions.  Timestamps  can  not  serialize  DAs  that  were  created  by  concurrent  subactions, 
however,  since  they  get  serialized  dynamically;  therefore  we  introduced  a  new  mechanism  of 
conflict  tables  (CTs)  for  them.  The  CT  mechanism  keeps  a  table  at  the  concurrent-set  action’s 
site  to  monitor  the  serialization  of  the  concurrent  subactions  explicitly.  We  also  introduced 
fingerprint  locks  to  help  retain  access  information  at  the  object. 

Our  serialization  technique  is  an  ad  hoc  serialization  method,  a  hybrid  of  timestamping- 
based  protocol  ([Reed  1978,  Reed  1983]),  locking-based  protocol  for  nested  atomic  actions 
([Moss  1981])  and  an  explicit  serialization  control  method. 

In  Chapter  5  we  described  in  detail  how  the  two-phase  commit  protocol,  which  is  used  by 
the  current  model  to  commit  the  top-level  action,  is  modified  to  cope  with  problems  introduced 
by  DAs:  DAs  may  still  be  active  when  their  topaction  commits,  and  the  sites  they  used  and 
the  subactions  they  aborted  may  not  be  known  to  the  protocol’s  coordinator  when  it  begins. 

In  Chapter  6  we  introduced  a  novel  idea  of  committing  transactions  in  spite  of  some  failures. 
The  idea  derives  from  the  semantics  of  DAs;  DAs  basically  do  functionally  redundant  work.  We 
proposed  two  ways  in  which  some  sites  that  were  used  only  by  DAs  can  be  ignored,  if  necessary, 
during  the  commit  process. 

We  feel  that  we  achitvtd  our  design  goals:  minimizing  the  change  to  the  design  of  the 
current  model  and  its  performance,  and  ensuring  that  disconnected  actions  have  a  fair  chance 
of  success.  When  DAs  are  not  used,  the  new  model  does  not  send  any  extra  messages,  does 
very  little  additional  local  processing,  and  requires  little  additional  storage;  when  DAs  are  used, 
they  are  almost  as  likely  to  succeed  as  regular  subactions. 
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7.2  Future  work 


Below  we  present  several  directions  in  which  more  work  can  be  done: 

7.2.1  Use  of  an  Explicit  Approach  to  provide  Atomicity 

Though  disconnected  actions  were  initially  introduced  into  the  current  model  in  order  to  per¬ 
form  functionally  redundant  work,  they  can  also  be  used  to  do  work  that  is  essential  to  their 
transaction.  One  example  is  their  use  for  quorum-sets,  which  is  described  in  the  thesis.  An¬ 
other  potential  “non- redundant”  use  of  DAs  can  be  achieved  if  the  model  is  enhanced  to  provide 
programmers  with  explicit  tools  for  implementing  atomic  types,  as  described  in  [Weihl  1984]. 

As  an  example  for  the  use  of  the  explicit  approach,  consider  an  operation  OP  on  some 
abstract  object  that  is  implemented,  as  described  earlier  in  this  thesis,  as  two  operations:  the 
first  (OPi)  does  the  real  work  (reads  or  modifies),  and  the  second  ( OPda )  runs  as  a  DA  and 
improves  the  representation  of  the  object.  As  DAs  were  described  in  our  model,  they  are  not 
guaranteed  to  succeed;  therefore  a  subsequent  use  of  OPi  may  find  that  the  previous  OPda  is 
yet  not  done,  and  OP\  would  do  the  necessary  work  itself,  probably  causing  OPda  to  abort. 

The  implementation  can  be  improved  if,  using  the  explicit  approach,  the  second  use  of  OPi 
could  tell  that  the  prior  OPda  is  not  yet  done,  thus  requiring  a  delay  of  the  second  OP\  for  a 
short  period  until  the  first  OPda  commits.  This  would  result  in  a  better  performance  than  of 
aborting  the  first  OPda  and  starting  its  work  all  over  again. 

7.2.2  Use  a  Reed-like  method  for  serial  DAs 

Our  use  of  timestamps  for  serializing  DAs  was  only  a  simple  addition  to  the  existing  locking- 
based  mechanism;  DAs  found  to  access  the  object  out  of  order  were  aborted,  except  for  the 
case  of  multiple  readers.  In  Reed’s  scheme  ([Reed  1978]),  many  versions  are  kept  for  an  object, 
ordered  in  their  timestamps  order;  thus  the  conflicts  of  multiple  writers,  or  a  reader  that 
reads  after  a  later  writer  wrote,  can  be  sorted  out  without  aborting  by  accessing  an  earlier 
version,  rather  than  the  most  recent  one.  A  small  improvement  to  Reed’o  scheme  is  proposed 
in  [Aspnes,  et  al.  1988];  it  keeps  all  the  read-timestamps,  not  only  the  maximal  as  docs  Reed, 
and  subsequent  aborts  allow  for  a  reduction  in  the  value  of  the  maximal  read- timestamp. 


93 


We  can  have  a  different  scheme  for  serial  DAs,  which  is  similar  to  Reed’s.  Note  that  our 
locks  (and  versions)  resemble  Reed’s  multiple-version  scheme.  It  seems  that  the  cases  of  multiple 
writers  and  a  late  reader  could  be  sometimes  solved  without  the  need  to  abort. 

7.2.3  Use  other  Deadlock  Detection  methods  for  Concurrent  DAs 

As  stated  before,  we  find  the  problem  of  deadlock  detection  and  serialization  analogous;  a  cycle 
in  the  wait-for  graph  means  a  deadlock,  and  a  cycle  in  our  serialized-after  conflict  table  means 
non-serializable  execution. 

Most  deadlock  detection  methods  for  distributed  systems  build  some  variation  of  a  wait-for 
graph  in  either  a  centralized  or  distributed  manner.  The  centralized  ones  (including  hierar¬ 
chical  methods,  see  [Menasce  &  Muntz  1979])  resemble  our  constraint  table  method;  they  use 
a  distinguished  node  in  which  the  wait-for  graph  is  built.  The  distributed  protocols  (e.g., 
[Menasce  &  Muntz  1979,  Moss  1981,  Obermarck])  create  the  wait-for  graph,  whole  or  in  part, 
at  some  node  where  a  local  transaction  suspects  that  it  is  deadlocked.  The  distributed  proto¬ 
cols  are  also  known  as  edge-chasing  ones  because  they  try  to  detect  a  cycle  by  following  the 
graph’s  edges,  often  requiring  messages  to  be  sent  between  nodes  that  have  transactions  along 
the  wait-for  path. 

We  have  chosen  a  centralized  method  to  prevent  non-serializable  execution  of  concurrent 
subactions  that  use  DAs  because  it  was  simpler  and  seemed  to  require  less  additional  messages. 
We  believe  that  a  distributed  method,  similar  to  some  distributed  deadlock  detection  protocol, 
can  also  be  devised. 

7.2.4  Optimistic  Model 

Work  can  be  done  on  using  disconnected  actions  in  a  model  that  uses  optimistic  concurrency 
control,  like  the  one  in  [Gruber  1989].  The  optimistic  approach  allows  atomic  action  to  execute 
without  synchronization  (e.g.,  block  when  another  action  uses  a  needed  object),  and  rely  on 
commit-time  validation  to  ensure  serialization  of  the  actions.  Optimistic  methods  are  useful 
when  things  are  not  likely  to  “go  wrong”,  since  they  pay  a  higher  penalty  (than  pessimistic 
ones)  if  things  do  go  wrong. 
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A  quasi-optimistic1  method  can  be  combined  with  our  pessimistic  one  to  handle  the  case  of 
quorum-sets.  Our  technique  of  starting  up  some  N  D  ■  -  concurrently,  while  only  a  quorum  of 
M  ( M  <  N )  needs  to  commit,  has  a  high  success  probability.  This  technique  can  be  speeded 
up  with  an  optimistic  model,  since  it  is  unlikely  to  fail;  the  creator  of  the  quorum-set  need  not 
be  blocked  until  the  quorum  commits,  but  can  continue,  and  be  undone  if  less  than  the  quorum 
committed. 

7.2.5  Claiming  DAs 

DAs,  as  presented  in  our  work,  are  synchronized  with  their  transaction  only  when  the  latter 
commits.  Therefore  DAs  can  not  return  replies  to  their  transaction  like  handler  calls  do.  A 
linguistic  support  for  asynchronous  work  was  proposed  in  [Liskov  &  Shrira  1988]  with  the  intro¬ 
duction  of  promises  into  the  language.  A  promise  is  returned  to  the  caller  when  an  asynchronous 
activity  begins,  and  can  be  used  by  the  program  to  reclaim  the  results  of  that  activity  later. 

More  work  can  be  done  to  try  and  combine  the  two  methods;  a  promise  can  be  supported 
b>  our  model  by  creating  a  DA  to  perform  the  asynchronous  activity,  and  maintaining  enough 
information  in  the  implementation  of  the  promise  (e.g.,  the  DA’s  AID)  to  track  the  relevant 
DA  and  find  out  about  its  fate  and  its  promised  reply. 

7.2.6  Other  Ideas 

Some  work  can  be  done  to  finish  up  working  out  the  details  of  the  points  we  ignored  in  the 
description  of  the  conflict  table  mechanism:  the  handling  of  aborts,  discarding  of  obsolete 
read  FP  locks  and  not  delaying  regular  actions.  The  details  of  extending  our  algorithms  and 
protocols  to  handle  nested  DAs  need  to  be  filled  in.  Similar  to  the  extension  of  timestamping, 
the  rest  of  the  work  is  likely  to  require  only  straightforward  extension. 


'This  method  is  not  fully  optimistic  because  locking  is  still  used. 
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Appendix  A 


New  implementation  of  the  AID 


The  following  possible  implementation  of  an  AID  is  compatible  with  its  use  in  this  thesis: 

%  Structure: 

%  AID  ->  NEW-GUARD<num>TOP-ACTION<numxbody> 

%  <bcdy>  ->  <aid.tagxnumxbody>  \  epsilon 

% 

aid  =  cluster  is  new,  make-si  ibaction,  make.nestedtop,  make-hcall,  make.DA,  LCA, 
make-concurrent-set-action,  descendant,  non-descendant,  proper-descendant, 
not-concurrent-relatives,  disconnected,  not-disconnected,  topaction,  top.DA, 
concurrent-set-action,  hcall,  get-parent,  getJocation,  get-topaction, 
subaction,  concurrent-subactions-between,  locations-of-disconnected-ancestors, 

It,  gt,  equal,  copy 

%% 

%%  Definitions 
%% 

%%  Tags  for  the  levels  in  an  AID 

aid-tag  =  oneof[TOP  ACTION, SUB-ACTION, NEW-GUARD.CONC.SETACTION.DISC-ACTIONinull] 
TOPACTION  =  aid-tag$make-TOP.ACTION(nil) 

SUBACTION  =  aid-tag$make-SUBACTION(nil) 

NEW.GUARD  =  aid-tag$make.NEW.GUARD(nil) 

CONC.SETACTION  *  aid-tag$make.CONC.SETACTION(nil) 

DISCACTION  =  aid-tag$make.DISCACTION(nil) 

aidJevel  *  struct[tg :  aid-tag,  nm :  int] 
site  =  int 

rep  =  sequence[aidJevell 

own  TopActionID :  int  :=  0  %%%  ID  generator  for  TopActions 
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%% 

%%  Make  AIDs 
%% 


new  =  proc  (GID:int)  returns  (cvt) 

%  RETURNS  an  aid  for  a  non-nested  top  action  at  guardian  #  GID 
TopActionID  :=  TopActionID  + 1  %  increment  ID  generator 

return(rep$[aidJevel${tg:  NEW.GUARD,  nm:  GID}, 

aid.level${tg  :TOP-ACTION,  nm  :TopActionl  D})) 

end  new 

make-subaction  *  proc  (Parent:  cvt,  Step:  int)  returns  (cvt) 

%  Requires:  Step  bigger  than  the  Step  given  previously  by  the  same  action 
%  Effects:  Returns  an  aid  for  a  local  sub  action  as  the  Parent’s  child 
%  (Enumerated  with  Step) 

return(rep$addh(Parent,aidJevel${tg:SUB.ACTION,  nm:step})) 
end  make-subaction 

make.concurrent-jseLaction  =  proc  (Parent:  cvt,  Step:  int)  returns  (cvt) 

%  Requires:  Step  bigger  than  the  Step  given  previously  by  the  same  action 
%  Effects:  Returns  an  aid  for  a  concurrent-set  action  as  a  child  of  Parent 
return(rep$addh(Parent,aidJevel${tg:CONC-SET-ACTION,  nm:step})) 
end  make-concurrent-set-action 

make-nestedtop  *  proc  (Parent:  cvt,  Step:  int)  returns  (cvt) 

%  Requires:  Step  bigger  than  the  Step  given  previously  by  the  same  action 
%  Effects:  Returns  an  aid  for  a  nested  top  action  as  a  child  of  Parent 
retun(rep$addh(Parent,aid-level${tg:TOP-ACTION,  nrn.step})) 

^.nestedtop 

mak<  ■>»:  cvt,  step:  int)  returns  (cvt) 

bigger  than  the  Step  given  previously  by  the  same  action 
, '  an  aid  for  a  top  disconnected  action  as  a  child  of  Parent 
j  Parent,  aidJevel${tg:DISC.ACTION,  nm:step})) 

eno  * 

makeJical!  *  proc  (Parent:  cvt,  GiD:  int)  returns  (cvt) 

%  Requires:  Step  bigger  than  the  Step  given  previously  by  the  same  action 
%  Requires:  Parent  is  a  call-action 

%  Effects:  Returns  an  aid  for  a  handler  call  (at  site  GID)  as  a  child  of  Parent 
return(rep$addh(Parent,aid.level${tg:NEW.GUARD,  nm:GID})) 
end  make-hcall 
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%% 

%%  Check  relations  between  AIDs 
%% 


any  .descendant  =  proc  (Child,  Parent:  rep)  returns  (bool) 

%  Effects:  Returns  true  iff  Child  is  a  descendant  of  Parent  (or  Child  •=■  Parent) 
if  rep$size(Parent)  >  rep$size(Chi!d)  then  return(false)  end 
possible-parent :  rep rep$subseq(Child,1  ,rep$size(Parent)) 
return(rep$similar(possible.parent;Parent)) 
end  any-descendant 

nested.topactionJbetween  =  proc  (Child,  Parent:  rep)  returns  (bool) 

%  Requires:  Child  is  a  descendant  of  Parent 
%  Effects:  returns  true  iff  Child  belongs  to  some  nested  topaction  that 
%  was  created  by  a  descendant  of  Parent. 
for  index:int  In  int$from.to(rep$size(Parent)+1  ,rep$size(Child))  do 
if  Child[index].tg  =  TOP-ACTION  then  return(true)  end 
end 

return(false) 

end  nested-topactionJoetween 

descendant  =  proc  (Child,  Parent:  cvt)  returns  (bool) 

%  Effects:  Returns  true  iff  Child  is  a  descendant  of  Parent.  However,  if 
%  there  is  a  nested  topaction  that  is  an  ancestor  of  Child  and  a 
%  proper  descendant  of  Parent,  returns  false. 

retum(any.descandant(Child,  Parent)  cand'nested-topaction_between(Child,  Parent)) 
end  descendant 

non-descendant  =  proc  (Child,  Parent:  cvt)  returns  (bool) 

%  Effects:  Returns  false  iff  Child  is  a  descendant  of  Parent.  However,  if 
%  there  is  a  nested  topaction  that  is  an  ancestor  of  Child  and  a 
%  proper  descendant  of  Parent,  returns  true. 

return("descendant(up(Chi!d),up(Parent))) 
end  non.descendant 

proper.descendant  =  proc  (Child,  Parent:  cvt)  returns  (bool) 

%  Effects:  Returns  true  iff  Child  is  a  proper  descendant  of  Parent. 

%  (And  there  is  no  nested  topaction  between  the  two). 
return(descendant(up(Child),up(Parent))  cand  "rep$similar(Child, Parent)) 
end  proper-descendant 

not-concurrent-relatives  =  proc  (A1  ,A2:cvt)  returns  (bool) 

%  Effects:  Returns  true  iff  A1  and  A2  are  not  concurrent  relatives 
the-LCA :  rep  :=  down(LCA(up(A1),up(A2))) 
except  when  not-related:  return(true)  end 
return(  rep$top(the.LCA).tg '«  CONC.SET.ACTION ) 
end  not-concurrent-reiatives 
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%% 

%%  Check  the  type  of  an  AID 

%% 

disconnected  =  proc  (A:cvt)  returns  (bool) 

%  Effects:  Returns  true  iff  A  is  disconnected  (i.e.,  descendant  of  top  DA) 
for  level:  aid-level  in  rep$elements(A)  do 

if  level.tg  =  DISC-ACTION  then  return(true)  end 

end 

return(false) 
end  disconnected 

not-disconnected  =  proc  (A:cvt)  returns  (bool) 

%  Effects:  Returns  false  iff  A  is  disconnected  (i.e.,  descendant  of  top  DA) 

return("disconnected(up(A))) 

end  not-disconnected 

tcpaction  =  proc  (A:  cvt)  returns  (bool) 

%  Effects:  Returns  true  iff  A  is  a  [nestedj  top  action 
return(rep$top(A).tg  =  TOP-ACTION) 
end  topaction 

top-DA  «  proc  (A:  cvt)  returns  (bool) 

%  Effects:  Returns  true  iff  A  is  a  top  disconnected  action 
retum(rep$top(A).tg  =  DISC-ACTION) 
end  top.DA 

concurrent-set-action  =  proc  (A:cvt)  retums(bool) 

%  Effects:  Returns  true  iff  A  is  a  concurrent  set  action 
»eturn(rep$top(A).tg  =  CONC-SET-ACTION) 
end  concurrent-set-action 

subaction  =  proc  (A:cvt)  returns(bool) 

%  Effects:  Returns  true  if  A  is  a  regular  subaction  (and  nothing  else) 
return(rep$top(A).tg  =  SUB-ACTION) 
end  subaction 

hcall  =  proc  (A:cvt)  retums(booi) 

%  Effects:  Returns  true  iff  A  is  a  handler  call 
return(rep$top(A).tg  =  NEW.GUARD) 
end  hcall 
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%% 

%%  Miscellaneous 
%% 


concurrent-SubactionsJoetween  =  iter  (Parent,  Child:cvt)  yields  (cvt) 

%  Requires:  Parent  is  an  ancestor  of  Child 
%  Effects:  Yields  every  child  A  of  a  concurrent-set  action  such  that  A 
%  is  an  ancestor  of  Child  and  a  proper  descendant  of  Parent 
for  index:  int  in  int$from_to(rep$size(Parent)+1  ,rep$siz6(Chiid))  do 
if  Child[index-1].tg  =  CONC.SET.ACTION  then 
yield(rep$subseq(Child,1  .index)) 
end 
end 

end  concurrent-subactions-between 

get-parent  =  proc  (A:cvt)  returns(cvt) 

%  Effects:  Returns  A's  parent  (or  A  if  A  is  a  (non-nested)  topaction) 
if  rep$size(A)  =  2  then  return(A)  end 
return(rep$remh(A)) 
end  get-parent 

LCA  =  proc(A1 ,  A2:cvt)  returns  (cvt)  signals  (not-related) 

%  Effects:  Returns  the  AID  of  LCA(A  i,A2),  signals  not-related  if  needed 
if  A1  [1  ]  '=  A2[1  ]  cor  A1  [2]  '=  A2[2]  then  signal, not-reiated  end 
short :  rep 

if  rep$size(A1)  >  rep$size(A2)  then  short :«  A2  eise  short  :=  A1  end 
for  index:int  In  int$from.to(2,rep$size(short))  do 

if  A1  [index]  '=  A2[index]  then  return(rep$subseq(Al ,  1  ,index-1 ))  end 

end 

return(short) 
end  LCA 

getJocation  =  proc  (A:  cvt)  returns  (site) 

%  Effects:  Returns  the  site  of  action  A 

for  index:int  in  int$fromio_by(rep$size(A),11-1)  do 

if  A[index].tg  =  NEW.GUARD  then  return(A[index].nm)  end 
end  %  for 
end  getJocation 

get-topaction  =  proc  (A:  cvt)  returns  (cvt) 

%  Effects:  returns  the  immediate  (possibly  nested)  topaction  of  A 
for  index:int  in  int$from.to(rep$size(A),2)  do 

if  A[index].tg  =  TOP-ACTION  then  return(rep$subseq(A,1  .index))  end 

end 

end  get-topaction 
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locations.of-disconnected-ancestors  =  lter(A:  cvt)  yields  (site, cvt) 

%  Effects:  Yields  pairs  of  top  disconnected  ancestors  of  A  and  their  sites 
for  index:  int  in  int$?rom.to(1  ,rep$size(A))  do 
if  A[index].tg  =  DISC.ACTION  then 

temp-aid  :  rep  :=  rep$subseq(A,1  .index) 
yield(getJocation(up(temp.aid)),temp.aid) 
end 
end 

end  locations-of-disconnected-ancestors 

equal  =  proc  (A1 ,  A2:  cvt)  returns  (boo!) 

%  Effects:  Returns  true  iff  A1  and  A2  denote  the  same  action 

return(  A1  =  A2 ) 
end  equal 

It  =  proc  (A1 ,  A2:  cvt)  returns  (bool) 

%  Requires:  A 1  and  A2  belong  to  same  transaction 

%  Effects:  Creates  a  total  order  based  on  AIDs  enumeration.  (A1  <  A2) 

%  (Returns  true  if  A 1  was  created  before  A2,  meaningless  if  A 1  and 

%  A2  are  concurrent  siblings) 

return(gt(up(A2)  ,up(A1 ))) 

end  It 

gt  =  proc  (A1 ,  A2:  cvt)  returns  (bool) 

%  Requires:  A 1  and  A2  belong  to  same  transaction 

%  Effects:  Creates  a  total  order  based  on  AIDs  enumeration.  (A1  >  A2) 

%  (Returns  true  if  A 1  was  created  after  A2,  meaningless  if  A 1  and 

%  A2  are  concurrent  siblings) 

point-of-diff :  int  :=  rep$size(down(LCA(up(A1),up(A2))))  + 1 

min-size :  int  :=  int$min(rep$size(A1),rep$size(A2)) 

if  point-ofjdiff  =  min-size  + 1  then 

return(  rep$size(A1)  >  min_size ) 
else 

return(  A'i[point.ofjdiff].nm  >  A2[poinLof-diff].nm ) 

end 

endgt 

copy  =  proc  (A:cvt)  returns  (cvt) 
retum(rep$copy(A)) 
end  copy 


end  aid 
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Appendix  B 


Implementation  of  Timestamps 


The  following  possible  implementation  of  a  timestamp  is  compatible  with  its  use  in  this  thesis: 

%  Structure: 

%  timestamp  ->  <  num>  timestamp  \  <  num> 

% 

timestamp  =  cluster  is  new,  increment,  nest,  equal,  similar,  It,  gt,  max 
rep  =  sequence[int], 
new  =  proc()  returns(cvt) 

%  Effects:  Create  and  return  a  new  (zero)  timestamp 

return(rep$[0]) 

end  new 

increment  =  proc  (t:cvt)  returns(cvt) 

%  Effects:  Increments  the  current  level  by  1 
return(rep$addh(rep$remh(t),rep$top(t)  + 1)) 
end  increment 

nest  =  proc  (t:cvt)  returns(cvt) 

%  Effects:  Adds  a  new  level,  initially  0 

retum(rep$addh(t,0)) 

end  nest 

equal  =  proc  (tl  ,t2:cvt)  returns  (bool) 

%  Effects:  Returns  true  iff  two  timestamps  are  one 
return(t1  =  1 2) 
end  equal 

similar  =  proc  (tl  ,t2:cvt)  returns  (bool) 

%  Effects:  Returns  true  iff  two  timestamps  have  the  same  value 
return(rep$simiiar(t1  ,t2)) 
end  similar 
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It  =  proc  (tl  ,t2:cvt)  returns  (bool) 

%  Effects:  Returns  true  ifftl  <  t2 
for  index:int  in  rep$indexes(t1 )  do 

if  tl  [index]  >  t2[index]  then  return(false) 

elseif  tl  [index]  <  t2[index]  then  return(true) 
end  except  when  bounds:  return(false)  end 
end  %  for 
return(false) 
end  It 

gt  =  proc  (tl  ,t2:cvt)  returns  (bool) 

%  Effects:  returns  true  ifftl  >  t2 
return("lt(Rp(t1),up(t2))  cand  'equal(up(t1),up(t2))) 

end  gt 

max  =  proc  (tl  ,t2:cvt)  returns(cvt) 

%  Effects:  Returns  the  larger  of  two  timestamps 
if  It(up(t1),up(t2))  then  return(t2)  else  return(tl)  end 
end  max 

end  timestamp 
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